Definition
Hybrid attention describes architectures that deliberately mix two kinds of sequence-mixing layers: a small number of expensive full-attention layers and a larger number of cheaper, efficient layers such as sliding-window attention, linear attention, or state-space layers. The goal is to keep most of the long-range recall that full attention provides while paying the quadratic cost only occasionally, so the model stays affordable on long inputs.
Local plus global
A common pattern interleaves local and global attention. Local (sliding-window) layers let each token attend only to a fixed window of nearby tokens, which is cheap but myopic; the occasional global layer lets every token see the entire prefix so information can still travel across the whole sequence. Gemma models illustrate the idea: Gemma 2 alternated local and global layers one-to-one, while Gemma 3 moved to a five-to-one ratio with a smaller window, leaning harder on efficiency. Research suggests the full-attention layers carry most of the genuine long-range retrieval, while the efficient layers mainly shape how the model learns.
State-space hybrids
The same logic applies to architectures that mix attention with non-attention mixers. Models that interleave state-space layers with a minority of attention layers, such as Jamba, are hybrid in this broader sense: the recurrent layers handle long context cheaply and the attention layers supply precise recall.
Hybrid attention is now the dominant design pattern for efficient long-context models, because it lets capable systems run on constrained, self-owned hardware. For the building blocks it combines, see linear attention, sub-quadratic attention, and Jamba.
In Simple Terms
Hybrid attention describes architectures that deliberately mix two kinds of sequence-mixing layers: a small number of expensive full-attention layers and a larger number of cheaper,…
