Grouped-Query Attention (GQA)

Sovereign AI

Grouped-query attention (GQA) is a memory-saving form of self-attention introduced in a 2023 Google paper by Ainslie and colleagues. It sits between two extremes: standard multi-head attention, where every query head has its own key and value heads, and multi-query attention (MQA), where all query heads share a single key/value head. GQA splits the query heads into groups, and each group shares one key/value head — a tunable dial between the quality of full multi-head attention and the efficiency of MQA.

Why fewer key/value heads matter

During generation, a Transformer caches the key and value vectors of every past token so it never recomputes them — the KV cache. That cache grows linearly with sequence length, and at long context it, not the model weights, often dominates memory. Its size scales with the number of key/value heads, so cutting them cuts the cache by the same ratio: a model with 32 query heads grouped over 8 KV heads carries a KV cache one quarter the size of its full multi-head equivalent. Since token generation is largely bottlenecked by memory bandwidth — streaming cache and weights through the GPU for every token — a smaller cache directly translates into faster generation and more headroom for longer prompts or bigger batches. The GQA authors also showed that existing multi-head checkpoints can be converted to GQA and "uptrained" with a small fraction of the original compute, which is why the technique spread so quickly through the open-model ecosystem.

The quality trade-off

Full multi-head attention gives every query head its own learned view of the past; MQA forces all heads to share one, which measurably hurts quality on some tasks. GQA's grouping recovers nearly all of the lost quality while keeping most of the memory savings — the paper's central result was that a modest number of KV groups matches multi-head quality at close to MQA speed. That favorable trade is why GQA became the default rather than a niche optimization.

Practical impact for sovereign operators

Most modern open-weight model families ship with GQA precisely because it lets long-context models run on consumer GPUs. When you evaluate whether a model fits your hardware, the KV-head count deserves as much attention as parameter count: two models of identical size can differ several-fold in KV-cache appetite, and it is the cache that decides whether a 32K-token session fits in your VRAM alongside the weights. A rough sizing habit for local inference: quantized weights plus KV cache at your intended context window must fit in memory together, and GQA is the design choice that keeps the second term manageable. Inference engines also quantize the KV cache itself for further savings — a stackable trick, since GQA reduces how many vectors exist and cache quantization shrinks each one.

Spotting it on a model card

GQA announces itself in two configuration fields: the number of attention heads and the number of key/value heads. When the second is smaller than the first, the model uses GQA, and the ratio between them is the KV-cache saving — 32 query heads over 8 KV heads means a cache one quarter the multi-head size; a KV-head count of one means full multi-query attention. This makes back-of-envelope sizing possible before downloading anything: KV-cache memory grows with layers, KV heads, head dimension, context length, and bytes per value, so halving KV heads or halving your context target each halve the cache. When a long conversation with a local model slows down or aborts with an out-of-memory error at high context, it is almost always this cache — not the weights — that hit the ceiling.

See also positional encoding (RoPE), which governs how those cached keys encode position, the parent attention mechanism, and quantization for the other half of the fit-it-on-your-GPU equation.

Grouped-query attention (GQA) is a memory-saving form of self-attention introduced in a 2023 Google paper by Ainslie and colleagues. It sits between two extremes: standard…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners