Linear Attention

Sovereign AI

Linear attention is a family of attention mechanisms that reduce the cost of self-attention from quadratic to linear in the sequence length. Standard softmax attention — the heart of the transformer — compares every token with every other token, so doubling the sequence quadruples the work and the memory. Linear attention sidesteps this by replacing the exponential softmax similarity with a kernel: queries and keys are passed through a feature map, and their plain dot products stand in for the softmax comparison. Because the operation becomes ordinary matrix multiplication, the associative property applies — the model can aggregate keys and values into a compact summary once and let every query read from that summary, instead of touching every past token individually. The result: compute and memory that grow linearly with sequence length, not quadratically.

The kernel trick at the core

The reordering is the whole insight. Softmax attention must materialize an attention matrix comparing all query–key pairs before weighting values — that matrix is the quadratic object. Write similarity as a product of feature maps instead, and the multiplication can be regrouped so keys and values are combined first into a fixed-size state, which each query then reads. The defining design choice in any linear-attention method is that feature map: the original Linear Transformer used a simple elementwise map that keeps similarities non-negative, while later methods such as Performer use randomized feature constructions that approximate the true softmax kernel more faithfully. The choice matters — it is where each variant trades fidelity to softmax against speed and stability.

The recurrent form: constant-memory inference

Linear attention's most practical consequence for self-hosters is its dual identity. The same computation can be unrolled as a recurrence: maintain a fixed-size state matrix, update it token by token as new keys and values arrive, and answer each query from the state. That makes a linear-attention model behave like an RNN at inference time — constant memory per token, regardless of how long the conversation runs. Contrast the standard transformer, whose key–value cache grows with every generated token and steadily devours VRAM, effectively taxing long context windows twice: once in compute, once in memory. On owned hardware with a fixed memory budget, a model whose footprint does not grow with context length is a genuinely different value proposition — long documents and long-running agent sessions stop being a hardware-upgrade problem.

Trade-offs to know

The efficiency is bought by giving up the exact softmax structure, and the cost is real. A fixed-size state must compress the past rather than store it, so tasks demanding sharp, precise recall of specific earlier tokens — exact copying, needle-in-a-haystack retrieval — are where pure linear attention degrades first. Early variants also had training-stability quirks that took years of refinements (normalization schemes, gating, decay terms) to tame. This is why many modern systems are hybrids: a majority of efficient linear or recurrent layers interleaved with a few full-attention layers that provide precise recall where it counts, or gated formulations that let the model control what its compressed state retains.

Where it leads

Linear attention is the conceptual foundation beneath much of the post-transformer architecture wave: RWKV reformulates it into a purely recurrent language model, gated linear attention adds learned gating to the state update, and the broader class of sub-quadratic attention methods — including state-space models — pursues the same goal by adjacent mathematics. For the sovereign-AI builder the through-line is simple: these architectures exist precisely to make capable models run on hardware you own, with context lengths that don't require a datacenter's memory. Watch this space when choosing what to run locally.

Linear attention is a family of attention mechanisms that reduce the cost of self-attention from quadratic to linear in the sequence length. Standard softmax attention…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Glossaire du minage

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Comparer les mineurs