Definition
Expert parallelism is a parallelization strategy built specifically for Mixture-of-Experts (MoE) architectures. An MoE layer contains many small sub-networks called experts, but each token is routed to only a few of them. Expert parallelism distributes those experts across multiple accelerators, so no single device has to hold them all, and dynamically sends each token to whichever device hosts the expert it was assigned.
Routing and all-to-all
The defining communication pattern is the all-to-all collective. Before the MoE layer runs, a gating network decides which expert each token belongs to, and an all-to-all dispatch shuffles tokens across devices so they land where their experts live. After the experts compute, a second all-to-all gathers the results back to their original positions. This dispatch/combine pair is the performance bottleneck of expert parallelism, and specialized token dispatchers are used to scale routing efficiently to dozens of expert-parallel devices.
Load balancing
Because routing is data-dependent, some experts can receive far more tokens than others, leaving devices unevenly loaded. MoE training therefore relies on auxiliary load-balancing objectives and capacity limits to keep work spread evenly. Expert parallelism is usually layered together with data, tensor, and pipeline parallelism to train the largest sparse models.
Expert parallelism is the systems counterpart to the Mixture-of-Experts model design. It sits alongside the core axes of Data Parallelism, Tensor Parallelism, and Pipeline Parallelism in a full distributed-training stack.
In Simple Terms
Expert parallelism is a parallelization strategy built specifically for Mixture-of-Experts (MoE) architectures. An MoE layer contains many small sub-networks called experts, but each token is…
