Definition
Mixture of Tokens is a Transformer architecture, presented at NeurIPS 2024, designed as a more stable alternative to the sparse Mixture of Experts (MoE) approach used to scale large language models. Like MoE, it lets a model carry a very large parameter count without a proportional increase in computation. Unlike MoE, it avoids making discrete routing decisions during training, which removes the instability and load-imbalance problems that plague sparse routing.
How it differs from Mixture of Experts
In a standard MoE, a router sends each token to a small number of specialized expert sub-networks — a hard, discrete choice that can be unstable to train and can leave some experts overloaded and others idle. Mixture of Tokens instead has each expert process a weighted average (a soft mixture) of tokens drawn from across examples. Because the mixing is a continuous, differentiable operation with no hard selection, the model trains as stably as a vanilla Transformer, and every expert receives the same number of tokens, so load imbalance disappears by construction.
Why it matters
The reported result is competitive quality with a vanilla Transformer at a several-fold reduction in compute and a meaningful wall-clock speedup, while remaining compatible with standard masked and causal language-model training. Efficiency techniques like this are central to making capable models cheaper to train and run — which directly affects whether strong open-weight models become small and efficient enough to self-host on accessible hardware, rather than remaining locked behind data-center-scale infrastructure.
Mixture of Tokens is one of several efficiency strategies behind modern foundation models, and the compute savings it targets echo the cost concerns addressed by Chinchilla-optimal training.
In Simple Terms
Mixture of Tokens is a Transformer architecture, presented at NeurIPS 2024, designed as a more stable alternative to the sparse Mixture of Experts (MoE) approach…
