Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Attention Sink

Sovereign AI

Definition

An attention sink is the empirical observation that the first few tokens of a sequence collect a disproportionately large share of attention weight, even when those tokens carry little semantic meaning. The phenomenon was characterized in the 2023 StreamingLLM work, which traced it to the softmax: because attention scores must sum to one, the model needs somewhere to dump probability mass it does not actually want to spend, and the ever-visible initial tokens become that default destination.

Why it happens

In autoregressive models every later token can attend back to the earliest tokens, so during training those positions are uniquely positioned to be trained as a stable, always-available sink. Removing them, as happens naively when you slide a fixed window forward and evict the oldest tokens, destabilizes the attention distribution and causes generation quality to collapse.

Practical use

The insight turned into an engineering technique. By keeping the key and value states of just a handful of initial sink tokens pinned in cache alongside a sliding window of recent tokens, a model trained on a finite window can stream over inputs of effectively unbounded length without fine-tuning and without quality collapse. Some newer architectures even add a dedicated learnable sink so the model never has to commandeer a real token for the job. For operators running streaming inference on their own machines, understanding the sink explains why you cannot simply truncate the oldest context for free.

Related: sliding window attention and the KV cache whose eviction policy the sink modifies.

In Simple Terms

An attention sink is the empirical observation that the first few tokens of a sequence collect a disproportionately large share of attention weight, even when…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners