Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Multi-Head Latent Attention (MLA)

Sovereign AI

Definition

Multi-Head Latent Attention (MLA) is an attention mechanism introduced in DeepSeek-V2 and carried into DeepSeek-V3 to attack the largest memory cost of long-context inference: the stored key and value state. Where Grouped-Query and Multi-Query Attention reduce that state by making heads share key/value projections, MLA takes a different route. It projects the keys and values jointly into a single small low-rank latent vector and caches only that compact latent, reconstructing the full per-head keys and values on the fly during attention.

Low-rank joint compression

The core trick is that the model never stores full-width keys and values at all. It learns a down-projection into a shared latent space and a matching up-projection back out. Only the latent is written to the cache, so the memory footprint per token collapses dramatically. DeepSeek reports storing roughly 70 KB per token versus several hundred kilobytes for comparable grouped-query models, a reduction of several times. Because positional rotary embeddings do not survive low-rank compression cleanly, MLA carries a small separate "decoupled" component to preserve position information.

Why it is notable

Unusually, the DeepSeek papers report that MLA not only saves memory but matches or slightly exceeds the quality of full multi-head attention, rather than trading quality for compression as the simpler sharing schemes do. That combination is why MLA drew intense interest among teams running large open-weight models on constrained hardware: it directly extends the context length and concurrency a fixed GPU budget can support.

For sovereign operators, MLA is one reason certain frontier-class open models remain practical to self-host. It represents an alternative philosophy to Grouped-Query Attention, both aiming to shrink the KV cache that dominates long-context memory use.

In Simple Terms

Multi-Head Latent Attention (MLA) is an attention mechanism introduced in DeepSeek-V2 and carried into DeepSeek-V3 to attack the largest memory cost of long-context inference: the…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners