Definition
Flash Attention is a fast, memory-efficient algorithm for computing the attention operation at the heart of every transformer model. It produces the exact same result as standard attention — it is not an approximation — but it is engineered to be IO-aware, meaning it minimises the slow reads and writes between the GPU's large high-bandwidth memory (HBM) and its small but fast on-chip SRAM. By tiling the computation and using an online-softmax trick, it never materialises the full attention matrix in global memory, avoiding the quadratic memory blow-up that makes long sequences expensive.
Why it matters
Standard attention is slow and memory-hungry because both its time and memory cost grow with the square of the sequence length. Flash Attention runs substantially faster (up to ~7.6x on GPT-2 in the original paper) and uses memory that scales linearly with sequence length, making longer context windows practical on the same hardware. For a self-hoster, that translates directly into being able to feed more text to a local model without running out of VRAM.
Where you encounter it
Most modern inference engines and training frameworks enable Flash Attention (or its successors) automatically when supported hardware is present. You rarely configure it by hand, but its presence is a major reason a given GPU can handle a larger context window than older software allowed.
It is one of several low-level tricks that make local inference viable; see also continuous batching for throughput on the serving side.
In Simple Terms
Flash Attention is a fast, memory-efficient algorithm for computing the attention operation at the heart of every transformer model. It produces the exact same result…
