Mixture of Experts (MoE)

Sovereign AI

Mixture of Experts (MoE) is a neural network architecture that splits a model's feed-forward layers into many specialized sub-networks called experts, with a small learned router deciding which experts process each token. Instead of pushing every token through the entire network the way a dense model does, an MoE activates only a handful of experts per token — so the model can carry an enormous total parameter count while spending only a fraction of it in compute for any given input. It is the architecture behind most of the recent open models whose spec sheets read like a riddle: hundreds of billions of parameters "total," a small fraction "active."

The router and sparse activation

At the heart of every MoE layer sits the gating network, or router: a lightweight function that scores all experts for the current token and selects the top-k — commonly one or two — to actually run. The token's representation is processed by those experts only, their outputs blended by the router's weights, and the rest of the layer's parameters sit idle. This is sparse activation: the distinction between total parameters (everything stored) and active parameters (what fires per token) is the key to reading MoE model cards. Training such routers has its own folklore — auxiliary load-balancing losses keep the router from collapsing onto favorite experts, and a common modern pattern adds one always-on shared expert alongside the routed ones to stabilize learning. The specialization that emerges is statistical rather than neatly human-readable: experts tend to attract particular token patterns and domains rather than tidy subjects.

Why MoE changes the hardware math

For self-hosters, MoE redraws the trade-off between capability and speed — in a lopsided way. Memory: the full model must be resident, because any expert may be summoned at any token, so RAM or VRAM requirements track total parameters. Compute: per-token work — and therefore inference speed — tracks only the active parameters. The result is a model that demands the memory of a giant but runs with the latency of something several times smaller. This lands especially well on machines with plentiful system RAM or unified memory bandwidth but modest raw compute, and engines like llama.cpp exploit it: with quantization shrinking the stored experts and CPU offload for the rarely-hot ones, MoE checkpoints deliver capability-per-token-second that no dense model of equal total size could approach on the same box.

Reading the numbers honestly

Treat an MoE's capability as sitting between its active and total counts — closer to what a dense model of similar training compute achieves than to either headline figure. A "100B-total / 10B-active" model is neither a 100B dense equal nor a mere 10B model; benchmarks, not parameter counts, settle where it lands. When sizing hardware: budget memory for total parameters (after quantization), estimate speed from active parameters, and verify with tokens per second on your own workload — the only benchmark that has never lied to anyone.

Why it matters for sovereign AI

MoE is a large part of why frontier-adjacent capability keeps arriving on owned hardware. Sparse activation lets open-model labs scale total capacity without scaling per-token cost, and the resulting checkpoints — memory-hungry but compute-light — happen to fit the homelab profile better than dense giants ever did: RAM is cheap; FLOPs are not. For anyone building a local LLM stack around self-hosted inference, MoE models are frequently the capability ceiling of what a given machine can run — provided you did the memory arithmetic on total, not active, parameters before clicking download.

Mixture of Experts (MoE) is a neural network architecture that splits a model’s feed-forward layers into many specialized sub-networks called experts, with a small learned…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners