Definition
Grouped-query attention (GQA) is a memory-saving form of self-attention introduced in a 2023 Google paper by Ainslie and colleagues. It sits between two extremes: standard multi-head attention, where every query head has its own key and value heads, and multi-query attention, where all query heads share a single key/value head. GQA splits the query heads into groups, and each group shares one key/value head, giving a tunable trade-off between quality and efficiency.
Why fewer key/value heads matters
During generation, a Transformer caches the key and value vectors of every past token so it does not recompute them, the so-called KV cache. This cache, not the model weights, often dominates memory when context grows long. By shrinking the number of key/value heads, GQA cuts KV-cache size by the group ratio, easing the memory bandwidth bottleneck that limits generation speed. The authors also showed existing multi-head checkpoints can be converted to GQA with a small fraction of original training compute.
Practical impact for sovereign operators
Most modern open-weight models, including the Llama and Mistral families, ship with GQA precisely because it lets long-context models run on consumer GPUs. When you compare whether a model fits your hardware, the number of key/value heads is as important as the parameter count.
See also positional encoding (RoPE) and Transformer.
In Simple Terms
Grouped-query attention (GQA) is a memory-saving form of self-attention introduced in a 2023 Google paper by Ainslie and colleagues. It sits between two extremes: standard…
