Definition
Tensor parallelism, also known as horizontal or intra-layer parallelism, divides the math inside a single layer across multiple accelerators. Instead of giving each device a copy of the whole model, the large weight matrices of a layer are sliced column-wise or row-wise, each device computes its partial result, and the pieces are combined. This lets a single layer that would never fit on one accelerator run across several.
How the work is split
In a transformer, the attention and feed-forward matrix multiplications are the natural targets. A weight matrix is partitioned so each device multiplies its slice against the activations, then an all-reduce or all-gather stitches the outputs back together before the next layer. Because this synchronization happens inside every layer, tensor parallelism carries the highest communication overhead of the common strategies and is most effective within a single high-bandwidth node where devices are tightly interconnected.
Where it fits
Tensor parallelism rarely stands alone at scale. It is typically combined with pipeline parallelism (splitting layers across devices) and data parallelism (replicating the whole stack) into what practitioners call 3D parallelism, the layout used to train the largest models. The rule of thumb is to keep tensor parallelism inside a node and use the lower-bandwidth strategies across nodes.
Tensor parallelism is one of the three core axes of distributed training. Compare it with Pipeline Parallelism, which splits the model by layer, and Data Parallelism, which replicates the model and splits the data.
In Simple Terms
Tensor parallelism, also known as horizontal or intra-layer parallelism, divides the math inside a single layer across multiple accelerators. Instead of giving each device a…
