Definition
Sub-quadratic attention is an umbrella term for sequence-mixing methods whose time and memory cost grow more slowly than the square of the sequence length. Standard self-attention compares every token to every other token, so doubling the input roughly quadruples the cost; this quadratic scaling is what makes very long contexts prohibitively expensive on conventional transformers. Sub-quadratic methods aim for linear, log-linear, or otherwise gentler growth, which is the key to running long-context models on hardware you can actually own.
The main families
Several approaches reach sub-quadratic cost by different routes. Kernel-based linear attention rewrites the similarity computation so cost scales linearly. State-space models carry a fixed-size recurrent state instead of comparing all token pairs. Sparse and hierarchical methods compute attention only over a selected subset of tokens, achieving roughly log-linear cost; segment-compression methods attend to a compressed summary of history rather than every past token. Each trades some of softmax attention's exactness for scalability.
An honest caveat
Sub-quadratic does not yet mean free. Many methods that are sub-quadratic in theory are implemented with constant-factor speedups in practice, and some underperform full attention on benchmarks that demand sharp recall. The field continues to refine designs that keep quality while delivering true sub-quadratic scaling, which is why production models often combine these methods with a few full-attention layers.
Sub-quadratic attention is the broad category that contains most efficiency innovations in modern sequence modeling. See linear attention, selective state space, and hybrid attention for specific instances.
In Simple Terms
Sub-quadratic attention is an umbrella term for sequence-mixing methods whose time and memory cost grow more slowly than the square of the sequence length. Standard…
