Definition
Gated linear attention (GLA) is a refinement of linear attention that adds data-dependent gating to the model's recurrent state. Plain linear attention accumulates information into a fixed-size state matrix but has no principled way to forget; over long sequences this can blur or saturate the state. GLA introduces gates whose values depend on the input, letting the model decide how much of the existing state to keep and how much new information to write at each step. This gives it a controllable, decaying memory while preserving the linear-time, constant-space inference that makes linear attention attractive.
Parallel training, recurrent inference
Like other modern linear-recurrent models, GLA exhibits a sequential-parallel duality: it can be trained in a parallel form that uses the prefix-scan algorithm to process a whole sequence efficiently on a GPU, then deployed in a recurrent form that updates one fixed-size state per token. That second form is what gives it constant memory at inference, with no key-value cache that grows as context lengthens, making it well suited to serving long-context models on bounded hardware.
Where it sits
GLA belongs to the same generation of efficient architectures as gated linear RNNs, RetNet, and the state-space models, all of which combine linear recurrence for cheap training with data-dependent state updates for stronger recall. The shared theme is replacing quadratic attention with a gated, fixed-size memory that you can run economically and locally.
For the broader context, see linear attention, selective state space, and state space duality.
In Simple Terms
Gated linear attention (GLA) is a refinement of linear attention that adds data-dependent gating to the model’s recurrent state. Plain linear attention accumulates information into…
