Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Multi-Query Attention (MQA)

Sovereign AI

Definition

Multi-Query Attention (MQA) is a modification of the standard multi-head attention used in transformer language models. In ordinary multi-head attention, every query head has its own matching key head and value head. MQA keeps the multiple query heads but collapses all of the key and value projections down to a single shared key head and a single shared value head. Proposed by Noam Shazeer in 2019, it was designed specifically to make autoregressive inference faster.

The memory bottleneck it addresses

During token-by-token generation, the slow part is not arithmetic but reading the stored keys and values back from GPU memory for every new token. Because MQA stores only one key/value pair per layer instead of one per head, it shrinks that stored state dramatically, sometimes by an order of magnitude on wide models. Smaller stored state means less memory traffic per generated token and therefore lower latency, plus room to serve longer contexts or more simultaneous requests on the same card.

The trade-off

Forcing every query head to attend through the same key/value projection removes representational capacity, and models trained with pure MQA can show measurable quality loss and training instability compared to full multi-head attention. This limitation is exactly why the later Grouped-Query Attention design exists, offering a tunable middle ground rather than the all-or-nothing collapse MQA imposes.

For an operator running open-weight models on local hardware, recognizing whether a model uses MQA explains its memory footprint and helps predict how many concurrent sessions a GPU can hold. MQA's savings stack with the gains from a well-managed KV cache, and it sits on a spectrum alongside Grouped-Query Attention and full multi-head attention.

In Simple Terms

Multi-Query Attention (MQA) is a modification of the standard multi-head attention used in transformer language models. In ordinary multi-head attention, every query head has its…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners