Definition
The KV cache, short for key-value cache, is the working memory that makes autoregressive text generation practical. During inference, a transformer's self-attention layers compute key (K) and value (V) tensors for every token. Rather than recomputing those tensors for the entire sequence each time a new token is produced, the model stores them in the KV cache and reuses them, appending only the new token's key and value as it goes.
Why It Matters for Speed
Without a KV cache, generating each new token would require re-running attention over the whole sequence so far, an O(n-squared) cost that grows brutally as text gets longer. The cache reduces per-token work to roughly O(n), commonly delivering several times faster generation. This is the single most important optimization separating a usable local chatbot from one that crawls.
The Memory Cost
That speed comes from spending memory. The KV cache grows linearly with sequence length, the number of layers, and the model's hidden size, so a long conversation can consume gigabytes of RAM or VRAM on its own, separate from the model weights. This is why long context windows are expensive to serve and why running big models locally often hits a memory wall before a compute wall. Techniques like cache quantization and paged attention exist specifically to tame it.
Understanding the KV cache explains why your local model slows down and eats memory as conversations lengthen. It pairs directly with our entries on Tokens per Second and Local LLM.
In Simple Terms
The KV cache, short for key-value cache, is the working memory that makes autoregressive text generation practical. During inference, a transformer’s self-attention layers compute key…
