Passer au contenu

Bitcoin accepté au paiement  |  Expédié depuis Laval, QC, Canada  |  Soutien expert depuis 2016

Grouped-Query Attention (GQA)

Sovereign AI

Definition

Grouped-Query Attention (GQA) is an attention mechanism that interpolates between two extremes: full multi-head attention, where every query head owns a private key and value head, and Multi-Query Attention, where all query heads share one. GQA partitions the query heads into a small number of groups, and every head within a group shares a single key head and value head. Introduced by Google researchers in 2023, it has become the default in many modern open-weight models, including the Llama family.

Why it is a sweet spot

The two endpoints are special cases of GQA. If the number of groups equals the number of heads, each head has its own key/value pair and GQA is identical to standard multi-head attention. If there is only one group, every head shares one key/value pair and GQA is identical to Multi-Query Attention. By choosing an intermediate group count, a model captures most of the memory and speed benefit of sharing while retaining nearly all the quality of full attention. A typical configuration might use eight key/value groups for thirty-two query heads.

Practical impact

The shrunken key/value state means less data to read from GPU memory during each decode step, which directly lowers per-token latency and lets more requests fit in the same memory budget. Crucially, an existing multi-head model can be "uptrained" into a GQA model with only a small fraction of the original training compute, so the technique was cheap to retrofit onto already-trained checkpoints.

For self-hosters, GQA is usually the reason a large model still fits comfortably on consumer or prosumer GPUs while serving long contexts. It reduces pressure on the KV cache and sits between full attention and Multi-Query Attention on the efficiency spectrum.

In Simple Terms

Grouped-Query Attention (GQA) is an attention mechanism that interpolates between two extremes: full multi-head attention, where every query head owns a private key and value…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Glossaire du minage

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Comparer les mineurs