Mixture of Depths

Sovereign AI

Mixture of Depths (MoD) is a transformer technique, introduced by Google DeepMind researchers in 2024, that spends compute unevenly across tokens instead of taxing every token equally. In a standard transformer, every token passes through every layer's self-attention and feed-forward computation whether or not it needs the work: the word "the" gets the same deep processing as the token deciding the answer to a question. MoD instead places a lightweight router at each layer that scores the incoming tokens and selects only the top-k most relevant ones to receive the full computation; the remaining tokens skip the layer through a residual shortcut, passing along unchanged. The model learns, during training, which tokens deserve depth.

How the routing works

The crucial engineering choice is that the budget k is fixed in advance as a fraction of the sequence, so each layer processes a known, constant number of tokens. That gives the network a static computation graph with known tensor shapes, a real advantage for hardware efficiency, because GPUs and accelerators strongly prefer predictable, uniform workloads over conditional branches whose sizes vary at runtime. Earlier conditional-compute schemes often saved theoretical FLOPs but ran no faster in practice for exactly this reason. The router itself is tiny, a learned scoring function per layer, and training teaches it which tokens can coast: high-frequency filler and syntactically predictable tokens tend to pass through cheaply, while semantically loaded tokens get the deep treatment. The result is a dial the designer can set: either faster inference and training at equal quality, or higher quality at equal compute, with the published work reporting meaningful FLOP reductions at matched performance.

Relation to Mixture of Experts

MoD is distinct from the better-known Mixture of Experts (MoE), and the names invite confusion. MoE routes each token to one of several specialized sub-networks (experts) of the same depth, spending the same per-token compute but choosing which parameters process it. MoD decides whether a token is processed by a layer at all, choosing how much compute it receives. The two are orthogonal and can be combined, a configuration the researchers dubbed MoDE, routing tokens both across experts and around layers. Together they represent the broader shift in model design away from brute-force uniform computation and toward learned allocation of effort, the same shift sparse attention makes along the sequence axis.

Why it matters for local AI

For sovereign users, techniques like MoD matter because they attack the cost structure of intelligence rather than the price tag of a subscription. Every FLOP saved per token widens the set of hardware that can run a capable model: the difference between needing a datacenter GPU and getting acceptable speed from a consumer card or a mini-PC in your workshop is exactly these architectural efficiencies compounding. Skipping compute for easy tokens speeds both the prefill phase, where a long prompt is ingested, and the token-by-token decode that follows. The underlying idea, match work to need, spend where it yields, is one any miner recognizes: a tuned machine allocates power only where it produces hashrate, and a well-designed model allocates depth only where it produces meaning. As open-weight models absorb these techniques, the efficiency frontier keeps moving toward hardware individuals actually own, and that trajectory, capable intelligence on sovereign silicon, is the one worth watching.

The honest caveats: MoD models must be trained with routing from the start, so the technique cannot simply be bolted onto an existing checkpoint, and skipping layers interacts non-trivially with the caching machinery that serves autoregressive generation, which is part of why adoption in released open models has trailed the paper. Architectural efficiencies tend to arrive in waves, an idea is published, inference engines adapt, then model families absorb it, and MoD sits mid-pipeline in that process.

Mixture of Depths (MoD) is a transformer technique, introduced by Google DeepMind researchers in 2024, that spends compute unevenly across tokens instead of taxing every…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners