Definition
Sliding window attention is an attention mechanism that restricts each token to attend only to a fixed number of nearby tokens — a window of size w — instead of every token in the sequence. Standard self-attention compares every token against every other token, costing O(n²) in sequence length and making long contexts expensive in both compute and memory. Sliding window attention reduces this to roughly O(n·w), which is linear in sequence length for a fixed window, making long-context models far cheaper to run.
How distant information still flows
A common worry is that a small window blocks long-range understanding. In practice, stacked transformer layers solve this: a token attends to its window in one layer, and those neighbors already absorbed their windows in the layer below. After k layers with window w, information can propagate up to about k·w tokens — so a model with a 4,096-token window and many layers reaches an effective span well over 100,000 tokens. Mistral 7B popularized this design.
Memory benefits for self-hosting
A fixed window also bounds the key-value (KV) cache: only the last w tokens' states need to be kept, often via a rotating buffer, which can halve cache memory on long sequences without hurting quality. For a sovereign Bitcoiner running models on limited GPU memory, that smaller, predictable footprint is the practical payoff.
Sliding window attention is one member of the broader sparse attention family. See also the decode phase, where KV cache size dominates.
In Simple Terms
Sliding window attention is an attention mechanism that restricts each token to attend only to a fixed number of nearby tokens — a window of…
