Definition
An attention sink is the empirical observation that the first few tokens of a sequence collect a disproportionately large share of attention weight, even when those tokens carry little semantic meaning. The phenomenon was characterized in the 2023 StreamingLLM work, which traced it to the softmax: because attention scores must sum to one, the model needs somewhere to dump probability mass it does not actually want to spend, and the ever-visible initial tokens become that default destination.
Why it happens
In autoregressive models every later token can attend back to the earliest tokens, so during training those positions are uniquely positioned to be trained as a stable, always-available sink. Removing them, as happens naively when you slide a fixed window forward and evict the oldest tokens, destabilizes the attention distribution and causes generation quality to collapse.
Practical use
The insight turned into an engineering technique. By keeping the key and value states of just a handful of initial sink tokens pinned in cache alongside a sliding window of recent tokens, a model trained on a finite window can stream over inputs of effectively unbounded length without fine-tuning and without quality collapse. Some newer architectures even add a dedicated learnable sink so the model never has to commandeer a real token for the job. For operators running streaming inference on their own machines, understanding the sink explains why you cannot simply truncate the oldest context for free.
Related: sliding window attention and the KV cache whose eviction policy the sink modifies.
In Simple Terms
An attention sink is the empirical observation that the first few tokens of a sequence collect a disproportionately large share of attention weight, even when…
