Sub-Quadratic Attention

Sovereign AI

Sub-quadratic attention is an umbrella term for sequence-mixing methods whose time and memory cost grow more slowly than the square of the sequence length. Standard self-attention compares every token to every other token, so doubling the input roughly quadruples the cost; this quadratic scaling is what makes very long contexts prohibitively expensive on conventional transformers. Sub-quadratic methods aim for linear, log-linear, or otherwise gentler growth, which is the key to running long-context models on hardware you can actually own.

Why quadratic cost bites twice

The pain is not only compute. A standard attention mechanism also keeps a key-value cache entry for every past token in every layer, so serving a long context consumes VRAM in proportion to context length even before any arithmetic happens. On a data-center cluster that is a billing line; on a single consumer GPU it is a hard wall. This is why the context window you can actually use locally is usually set by memory, not by the model's advertised maximum — and why sub-quadratic designs matter disproportionately to self-hosters.

The main families

Several approaches reach sub-quadratic cost by different routes. Kernel-based linear attention rewrites the similarity computation so it can be accumulated left-to-right in linear time. State-space models such as Mamba carry a fixed-size recurrent state instead of comparing all token pairs — see state-space model (Mamba). Sparse attention computes attention only over a selected subset of token pairs, and sliding-window attention is the simplest sparse pattern of all: each token attends only to its recent neighbours, giving linear cost at the price of no direct long-range links. Segment-compression methods attend to a compressed summary of history rather than every past token. Each family trades some of softmax attention's exactness for scalability.

What does not count

A common confusion is worth clearing up: FlashAttention is not sub-quadratic. It computes exact full attention but reorganizes the work to respect GPU memory hierarchy, delivering large constant-factor speedups while the asymptotic cost stays quadratic. It is the standard example of how far implementation quality can go — and of why "sub-quadratic" claims should be read carefully, since many methods that are sub-quadratic in theory are beaten in practice by a well-engineered exact kernel until sequences get very long.

An honest caveat

How to evaluate the claims

When a model card advertises efficient long context, three questions cut through the fog. First, is the method sub-quadratic in memory as well as compute, since memory is what a single GPU runs out of first? Second, was the model trained at the context length being advertised, or merely extrapolated to it? Third, how does it score on recall-heavy tasks — retrieving an exact needle from deep context — rather than only on perplexity? Methods that pass all three are still rare, which is why this glossary treats the category as a frontier rather than a solved problem — one moving fast enough that any specific ranking of methods goes stale within a release cycle, while the three questions themselves do not.

Sub-quadratic does not yet mean free. Some methods underperform full attention on benchmarks that demand sharp recall — retrieving one exact fact from deep in the context — which is precisely the ability the all-pairs comparison buys. The field's current answer is hybridization: production models increasingly combine a linear-cost backbone with a few full-attention layers, keeping quality while cutting the memory bill. Architectures like Jamba interleave the two on exactly this logic. For a sovereign builder deciding what to run locally, the practical rule is to benchmark on your own long-context tasks: the efficiency families differ most exactly where marketing benchmarks look most alike. See selective state space and Jamba for the leading instances.

Sub-quadratic attention is an umbrella term for sequence-mixing methods whose time and memory cost grow more slowly than the square of the sequence length. Standard…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners