Definition
Pipeline parallelism, also called vertical or inter-layer parallelism, splits a model by depth. The layers are divided into consecutive stages, and each stage lives on a different accelerator. A batch flows through the stages like an assembly line: device one computes the first block of layers, hands its activations to device two, and so on, with gradients flowing back the same path in reverse during the backward pass.
Keeping the pipeline busy
The obvious problem is idle hardware. While device one works on the first stage, the later devices have nothing to do, and vice versa. This idle time is called the pipeline bubble. The standard fix is to split each batch into smaller micro-batches and feed them in staggered, so multiple stages are working on different micro-batches at once. Schedules such as GPipe and 1F1B reduce the bubble further by interleaving forward and backward passes carefully.
Communication profile
Pipeline parallelism only needs to pass intermediate activations between adjacent stages, so its communication cost is relatively low compared to tensor parallelism. That makes it well suited to spanning across nodes where bandwidth is more limited, complementing tensor parallelism inside each node.
Pipeline parallelism is one leg of 3D parallelism. See Tensor Parallelism for splitting work inside a layer and Gradient Accumulation, which shares the micro-batch mechanics used to fill the pipeline bubble.
In Simple Terms
Pipeline parallelism, also called vertical or inter-layer parallelism, splits a model by depth. The layers are divided into consecutive stages, and each stage lives on…
