Flash Attention

Sovereign AI

Flash Attention is a fast, memory-efficient algorithm for computing the attention operation at the heart of every transformer model. It produces the exact same result as standard attention — it is not an approximation — but it is engineered to be IO-aware, meaning it minimizes the slow reads and writes between the GPU's large high-bandwidth memory (HBM) and its small but very fast on-chip SRAM. The insight behind it is that attention on modern GPUs is bottlenecked by memory traffic, not arithmetic: the chip can multiply matrices faster than it can shuttle them in and out of memory, so the winning move is to reorganize the computation around moving less data.

How it works

Standard attention computes a full N×N matrix of interactions between every pair of positions in the sequence, writes it to GPU memory, applies softmax, and reads it back — memory traffic and storage that grow with the square of the sequence length. Flash Attention instead tiles the computation: it processes the query, key, and value matrices in small blocks that fit in on-chip SRAM, and uses an online-softmax trick to accumulate correct normalized results incrementally across blocks. The full attention matrix is never materialized in global memory at all. The result is exact attention with memory that scales linearly with sequence length, and substantial wall-clock speedups — the original paper reported up to roughly 7.6x on GPT-2-scale attention, with later versions (Flash Attention 2 and 3) squeezing more of the GPU's theoretical throughput through better parallelism and hardware-specific tuning.

Why it matters

The quadratic memory cost of naive attention was the practical wall that kept context lengths short: doubling the sequence quadrupled the attention matrix. Removing that blow-up is a major reason modern models offer the long context windows they do, and why a given GPU can serve far longer prompts than older software allowed. For a self-hoster, the effect is direct: more of your VRAM goes to model weights and KV cache instead of attention scratch space, so the same card handles longer documents, bigger RAG contexts, and faster prompt processing. Long-context local AI is, to a meaningful degree, downstream of this one algorithm.

Where you encounter it

Almost never as a knob you turn. Modern inference engines and training frameworks enable Flash Attention or a successor automatically when supported hardware is present, and runtimes like llama.cpp implement the same IO-aware ideas adapted to consumer GPUs and CPUs. You will occasionally see it as a build flag or a log line at startup; its absence on older hardware is one reason identical models run disproportionately slower there. It is one of several low-level tricks that make local inference viable — see also continuous batching for the serving-side counterpart.

The takeaway

Flash Attention is a reminder that in modern AI, the bottleneck is usually memory movement rather than raw compute — the same lesson ASIC designers learned about feeding hash cores. Nothing about the transformer's math changed; someone simply looked at where the data actually travels and rearranged the work. The payoff, for anyone running models on their own hardware, is longer contexts and faster tokens from silicon you already own.

Flash Attention is a fast, memory-efficient algorithm for computing the attention operation at the heart of every transformer model. It produces the exact same result…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Glossaire du minage

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Comparer les mineurs