KV Cache (Key-Value Cache)

Sovereign AI

The KV cache, short for key-value cache, is the working memory that makes autoregressive text generation practical. During inference, a transformer's self-attention layers compute key (K) and value (V) tensors for every token. Rather than recomputing those tensors for the entire sequence each time a new token is produced, the model stores them in the KV cache and reuses them, appending only the new token's key and value as it goes.

Why It Matters for Speed

Without a KV cache, generating each new token would require re-running attention over the whole sequence so far, an O(n-squared) cost that grows brutally as text gets longer. The cache reduces per-token work to roughly O(n), commonly delivering several times faster generation. This is the single most important optimization separating a usable local chatbot from one that crawls.

The Memory Cost

That speed comes from spending memory. The KV cache grows linearly with sequence length, the number of layers, and the model's hidden size, so a long conversation can consume gigabytes of RAM or VRAM on its own, separate from the model weights. This is why long context windows are expensive to serve and why running big models locally often hits a memory wall before a compute wall.

Doing the math

The cache size is predictable: for every token, each layer stores one key vector and one value vector per KV head. Multiply tokens × layers × KV heads × head dimension × 2 (for K and V) × bytes per element, and you have the cache footprint. The architectural lever hiding in that formula is the number of KV heads. Classic multi-head attention gives every attention head its own K and V; grouped-query attention (GQA) lets groups of query heads share a single KV head, cutting the cache by the grouping factor with minimal quality loss. This is why two models of similar parameter count can have wildly different long-context memory appetites — the KV head count on the model card matters as much as the size on disk.

Taming the cache

Because the cache is the marginal cost of context, most long-context engineering attacks it directly. Cache quantization stores K and V at 8-bit or lower precision instead of 16-bit, roughly halving (or better) the footprint with modest quality impact — many local runtimes expose this as a flag. Paged attention, popularized by server-grade inference engines, allocates the cache in small pages rather than one contiguous block, eliminating fragmentation and letting multiple conversations share memory efficiently. Prompt caching reuses the computed cache for a shared prefix — a long system prompt, a document under discussion — so it is processed once rather than per request. On your own hardware, the practical symptoms are easy to read: if generation slows and memory climbs as a chat gets longer, that is the cache growing; if a model that loads fine crashes on a long document, the weights fit but the cache did not. Budget for both.

Why sovereign operators should care

When you run models on hardware you own, the KV cache is the difference between the context window a model advertises and the context you can actually afford. A 128K-token window is only real if you have the memory to hold 128K tokens of cache alongside the weights. Understanding the formula lets you size a machine deliberately instead of discovering the wall by crashing into it. It also explains a counterintuitive buying rule: for long-document and long-conversation work, total memory and its bandwidth matter more than raw compute, since the cache — not the arithmetic — is what runs out first.

Understanding the KV cache explains why your local model slows down and eats memory as conversations lengthen. It pairs directly with our entries on Tokens per Second and Local LLM.

The KV cache, short for key-value cache, is the working memory that makes autoregressive text generation practical. During inference, a transformer’s self-attention layers compute key…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners