Attention Sink

Sovereign AI

An attention sink is the empirical observation that the first few tokens of a sequence collect a disproportionately large share of attention weight, even when those tokens carry little semantic meaning. The phenomenon was characterized in the 2023 StreamingLLM work, which traced it to the softmax at the heart of attention: because attention scores must sum to one across every position, the model needs somewhere to dump probability mass it does not actually want to spend, and the ever-visible initial tokens become that default destination.

What makes the discovery so satisfying is that it explains a genuinely strange behaviour with a simple mechanical cause. Researchers kept finding that transformer models poured enormous amounts of attention onto their opening tokens — frequently a lone start-of-text marker carrying no meaning whatsoever — and at first it looked like a bug or a training artifact worth stamping out. The softmax explanation instead reframes it as a necessity: attention weights are mathematically forced to sum to one, so any probability mass the model does not want to spend has to land somewhere, and the always-visible first tokens are the natural place for it to drain. Once you see the sink as the model's built-in release valve rather than a quirk to be fixed, an entire set of long-context engineering choices suddenly clicks into place.

Why it happens

In an autoregressive model, every later token can attend back to the earliest tokens, so those positions are uniquely available throughout the entire sequence. During training the model learns to treat them as a stable, always-present place to park excess attention — a kind of pressure-release valve for the softmax. The tokens themselves need not mean anything; their value is positional, not semantic. This is why the very first token, often a beginning-of-sequence marker, so reliably draws a heavy attention share regardless of what actually follows it in the text.

Why naive truncation breaks

The sink explains a failure that puzzled early long-context experiments. When you slide a fixed window forward and simply evict the oldest tokens to make room, you eventually evict the sink itself. Removing it destabilizes the entire attention distribution — the mass that used to land on the sink now has nowhere sensible to go — and generation quality collapses into repetition or gibberish. You cannot, it turns out, truncate the oldest context for free, and understanding the sink is what tells operators why the obvious memory-saving shortcut backfires so badly.

Turning it into a technique

The insight became an engineering method. By keeping the key and value states of just a handful of initial sink tokens pinned in cache alongside a sliding window of recent tokens, a model trained on a finite window can stream over inputs of effectively unbounded length — no fine-tuning required and no quality collapse. Some newer architectures go further and add a dedicated learnable sink, so the model never has to commandeer a real token for the job, freeing that position to carry genuine content.

Why it matters for self-hosting

For someone running streaming inference on their own machine, the sink is one of several reasons the KV cache cannot be managed as a naive first-in-first-out buffer. Any cache-eviction or context-compression scheme you build has to treat those first tokens as special and preserve them, or your carefully engineered long-context setup will quietly fall apart the moment the window fills. It is a small, sharp piece of knowledge that separates a streaming deployment that stays coherent over hours from one that degrades without any obvious cause. The sink pairs naturally with KV cache quantization and position interpolation as part of the toolkit that makes long, local, self-hosted context practical.

An attention sink is the empirical observation that the first few tokens of a sequence collect a disproportionately large share of attention weight, even when…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners