Definition
Mixture of Experts (MoE) is a neural network architecture that splits a model's feed-forward layers into many specialized sub-networks called experts, then uses a small router to decide which experts handle each input token. Rather than passing every token through the entire model, an MoE activates only a handful of experts at a time, giving the model a huge total parameter count while keeping the compute per token low.
The Router and Sparse Activation
At the heart of an MoE is the gating network, or router. For each token it produces a score over all experts and selects the top-k, usually just one or two, to actually run. This is called sparse activation: a model might hold hundreds of billions of total parameters, but only a small subset, the active parameters, fire for any given token. A common modern pattern is one always-on shared expert plus several routed experts, which improves stability.
Why It Matters for Local Inference
MoE changes the hardware math in a way that favors self-hosting. The full model must fit in memory, so you still need ample RAM or VRAM for all the experts, but the per-token compute, and therefore the inference speed, tracks only the active parameters. This lets a large, capable MoE model run surprisingly fast on hardware that could never run a dense model of the same total size.
MoE explains why some models report a large total parameter count but a much smaller active count. To gauge real-world speed on your gear, pair this with Tokens per Second and Local LLM.
In Simple Terms
Mixture of Experts (MoE) is a neural network architecture that splits a model’s feed-forward layers into many specialized sub-networks called experts, then uses a small…
