Definition
Mixture of Depths (MoD) is a transformer technique, introduced by Google DeepMind, that spends compute unevenly across tokens. In a standard transformer every token passes through every layer's self-attention and feed-forward computation, whether or not it needs the work. MoD instead places a lightweight router at each layer that selects only the top-k most relevant tokens to receive the full computation; the remaining tokens skip the layer via a residual shortcut. This lets the model concentrate effort where it matters and saves it elsewhere.
How the routing works
Because the budget k is fixed in advance, MoD caps how many tokens any layer processes, giving a static computation graph with known tensor shapes — a real advantage for hardware efficiency, unlike conditional-compute schemes whose tensor sizes vary at runtime. The router learns which tokens to promote, so easy or low-information tokens can pass through cheaply while harder tokens get the deep treatment. The result is either faster inference at equal quality or higher quality at equal compute.
Relation to Mixture of Experts
MoD is distinct from Mixture of Experts (MoE): MoE routes tokens to different specialized sub-networks of the same depth, while MoD decides whether a token is processed by a layer at all. The two can be combined. The underlying idea — match work to need — echoes how a tuned miner allocates power only where it yields hashrate.
For sovereign Bitcoiners watching the efficiency frontier, MoD is one route to running capable models on leaner hardware. See the prefill phase and sparse attention for related efficiency methods.
In Simple Terms
Mixture of Depths (MoD) is a transformer technique, introduced by Google DeepMind, that spends compute unevenly across tokens. In a standard transformer every token passes…
