Gated Linear Attention

Sovereign AI

Gated linear attention (GLA) is a refinement of linear attention that adds data-dependent gating to the model's recurrent state. Plain linear attention accumulates information into a fixed-size state matrix but has no principled way to forget; over long sequences the state blurs or saturates as everything ever seen piles into the same buffer. GLA introduces gates whose values depend on the current input, letting the model decide, token by token, how much of the existing state to keep and how much new information to write. This gives it a controllable, decaying memory while preserving the linear-time, constant-space inference that makes linear attention attractive in the first place.

Why forgetting is the hard part

Standard softmax attention never has to forget, because it re-reads the entire key-value cache at every step — perfect recall, paid for with memory and compute that grow with context length. Linear attention compresses history into a fixed-size state instead, which caps the cost but creates a new problem: a fixed buffer holding an unbounded stream must overwrite something. Ungated variants decay old information uniformly or not at all, and both choices hurt — either recent context evaporates too fast or stale context crowds out the new. Data-dependent gates resolve this by making retention itself a learned, input-conditional decision: keep what still matters, decay what does not. It is the same insight that made LSTMs work decades ago, rebuilt for modern parallel hardware — and it is why gated variants recall specific facts from long contexts far better than their ungated ancestors.

Parallel training, recurrent inference

Like other modern linear-recurrent models, GLA exhibits a sequential-parallel duality. For training, it can be expressed in a parallel or chunked form that processes whole sequences efficiently on GPUs, keeping the matrix-multiply utilization that deep-learning hardware is built around. For deployment, the same weights run in a recurrent form that updates one fixed-size state per token. That second form is the payoff for self-hosters: constant memory at inference, with no key-value cache growing as the conversation lengthens. A long-context model that would exhaust a consumer GPU's VRAM under standard attention can stream indefinitely when its memory footprint is a fixed state matrix. The trade is honest, though — a fixed-size state is a lossy summary, and tasks demanding exact recall of arbitrary distant tokens remain the stronghold of full attention.

Where it sits in the efficient-architecture landscape

GLA belongs to the same generation of efficient sequence models as gated linear RNNs, RetNet, and the state-space family, all of which combine linear recurrence for cheap scaling with data-dependent state updates for stronger recall. The connections run deep: the selectivity mechanism in selective state space models plays the same role as GLA's gates, and the state space duality results formalize why these architectures keep converging on similar math from different starting points. The shared theme is replacing quadratic attention with a gated, fixed-size memory you can run economically.

Why it matters for sovereign AI

For anyone running models on hardware they own, this architecture class is strategically important. Cloud providers can hide quadratic attention costs behind bigger clusters; a homelab cannot. Constant-memory inference means context length stops being a hardware upgrade and becomes a free parameter — a local assistant, a log analyzer chewing through months of miner telemetry, or a document-heavy retrieval stack all benefit from models whose serving cost does not balloon with input size. GLA and its relatives are how long-context AI stays within reach of the plebs rather than the exclusive property of hyperscale datacenters.

For the broader context, start with linear attention, then compare the gating story here with selective state space models.

Gated linear attention (GLA) is a refinement of linear attention that adds data-dependent gating to the model’s recurrent state. Plain linear attention accumulates information into…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners