Definition
Linear attention is a family of attention mechanisms that reduce the cost of self-attention from quadratic to linear in the sequence length. Standard softmax attention compares every token to every other token, costing time and memory proportional to the square of the sequence length. Linear attention sidesteps this by approximating the softmax similarity with a kernel: each query and key is passed through a feature map, and the resulting dot products replace the exponential comparison. Because the operation is now a plain product of feature maps, the associative property of matrix multiplication lets the model aggregate keys and values once and reuse that summary, achieving linear-time complexity.
The kernel trick at the core
The defining choice in any linear-attention method is the feature map applied to queries and keys. The original Linear Transformer uses a simple elementwise map that keeps similarities non-negative; later methods like Performer use randomized features to approximate softmax more faithfully. In recurrent form, linear attention maintains a fixed-size state matrix updated token by token, which is why it delivers constant-memory inference rather than a key-value cache that grows without bound.
Trade-offs to know
Linear attention buys efficiency by giving up the exact softmax structure, which can cost accuracy on tasks needing sharp, precise recall and can affect training stability. This is why many modern systems pair it with full attention or gating rather than using it alone.
Linear attention is the conceptual foundation under several architectures covered elsewhere in this glossary, including RWKV, gated linear attention, and the broader class of sub-quadratic attention methods that let capable models run on hardware you own.
In Simple Terms
Linear attention is a family of attention mechanisms that reduce the cost of self-attention from quadratic to linear in the sequence length. Standard softmax attention…
