Definition
Grouped-Query Attention (GQA) is an attention mechanism that interpolates between two extremes: full multi-head attention, where every query head owns a private key and value head, and Multi-Query Attention, where all query heads share one. GQA partitions the query heads into a small number of groups, and every head within a group shares a single key head and value head. Introduced by Google researchers in 2023, it has become the default in many modern open-weight models, including the Llama family.
Why it is a sweet spot
The two endpoints are special cases of GQA. If the number of groups equals the number of heads, each head has its own key/value pair and GQA is identical to standard multi-head attention. If there is only one group, every head shares one key/value pair and GQA is identical to Multi-Query Attention. By choosing an intermediate group count, a model captures most of the memory and speed benefit of sharing while retaining nearly all the quality of full attention. A typical configuration might use eight key/value groups for thirty-two query heads.
Practical impact
The shrunken key/value state means less data to read from GPU memory during each decode step, which directly lowers per-token latency and lets more requests fit in the same memory budget. Crucially, an existing multi-head model can be "uptrained" into a GQA model with only a small fraction of the original training compute, so the technique was cheap to retrofit onto already-trained checkpoints.
For self-hosters, GQA is usually the reason a large model still fits comfortably on consumer or prosumer GPUs while serving long contexts. It reduces pressure on the KV cache and sits between full attention and Multi-Query Attention on the efficiency spectrum.
In Simple Terms
Grouped-Query Attention (GQA) is an attention mechanism that interpolates between two extremes: full multi-head attention, where every query head owns a private key and value…
