Definition
Sparse attention is a family of techniques that approximate full self-attention by letting each token attend to only a chosen subset of other tokens instead of the entire sequence. Standard attention is dense — every token attends to every other token — which costs O(n²) in sequence length and becomes prohibitive for long documents. Sparse attention replaces that dense pattern with a structured one, cutting cost toward O(n) while aiming to preserve most of the model's expressive power.
Common sparse patterns
Practical designs mix a few building blocks. Local (window) attention connects each token to its near neighbors. Global attention lets a handful of special tokens attend to, and be attended by, the whole sequence — useful for summary or task tokens. Random attention adds a few random connections to shorten the path between distant tokens. Longformer combines local and global attention to reach linear scaling; BigBird adds random links and proved theoretically that such sparse patterns can match full attention's capabilities.
Why it matters for self-hosting
Sparse attention is one of the main reasons models can handle long contexts on modest hardware. By touching fewer token pairs, it shrinks both the compute of the prefill phase and the memory footprint of the key-value cache. For a sovereign Bitcoiner feeding long documents or transcripts into a locally run model, sparse attention is often what makes a large context window fit in available GPU memory.
Sliding window attention is one widely used form of sparse attention. See also the prefill phase it accelerates.
In Simple Terms
Sparse attention is a family of techniques that approximate full self-attention by letting each token attend to only a chosen subset of other tokens instead…
