Prefix Caching

Sovereign AI

Prefix caching is an inference optimization that reuses computation across separate requests whenever they begin with the same tokens. Many real workloads send prompts that share a long identical opening: a fixed system prompt, a reference document, or a block of few-shot examples. Without prefix caching, the model recomputes the attention state for that shared opening on every single request, wasting GPU cycles on work it has already done and forcing every user to pay again for the same fixed preamble.

From per-request to cross-request reuse

An ordinary key-value cache stores the attention state within one generation, so the model never re-reads its own earlier tokens. Prefix caching, also marketed as prompt caching or context caching, extends that idea across requests. After the first request computes the key and value blocks for a shared prefix, those blocks are kept and handed to any later request whose prompt starts the same way; the new request then only computes the tokens that actually differ. Serving engines detect the overlap automatically — vLLM does it by hashing fixed-size blocks, while SGLang organizes prefixes in a radix tree to find the longest match — so the reuse is transparent to the caller and requires no change to how prompts are written.

Operational payoff

For chat assistants, retrieval-augmented pipelines, and coding agents that resend a large stable context every turn, prefix caching can slash the time-to-first-token and cut redundant compute substantially. The gain grows with how much of the prompt is shared: a long fixed system preamble or a re-sent document is nearly free to process after the first time. The trade-off is memory, since cached prefixes occupy space that could otherwise hold active generations, so engines evict cold prefixes under pressure using a least-recently-used policy. On a memory-constrained home rig, that eviction behaviour is worth understanding, because an over-full cache can quietly start dropping the very prefixes you hoped to reuse.

The privacy dimension

Cross-request reuse raises an obvious question for anyone running a shared server: could one user's cached prefix leak into another user's response? Well-built engines guard against this with per-request or per-tenant salting, so a cache entry is only reused for the same logical context and never bleeds across trust boundaries. For a sovereign operator hosting a model for a household or a small group, understanding this is part of running the service responsibly — the performance win should never come at the cost of one user seeing another's context.

When it helps and when it does not

Prefix caching pays off exactly when prompts share long, stable openings and offers little when every request is unique. Structuring an application to put the fixed material first — system instructions and reference text ahead of the variable user input — is what lets the cache do its job, so a little prompt discipline turns a nice-to-have into a real speedup.

The economics of prefix caching are most striking for agent and retrieval workloads, where the reused prefix can dwarf the fresh input. An assistant that carries a long instruction block, a knowledge document, and a running conversation may resend tens of thousands of identical tokens on every turn while adding only a sentence of new user text. Caching that prefix means the expensive part is computed once and the incremental cost of each turn approaches the cost of just the new tokens. On a single home GPU that is often the difference between an agent that responds promptly across many turns and one that grows sluggish as its context balloons. The design lesson is to treat the stable parts of a prompt as an asset worth positioning first, so the cache can capture as much of the repeated work as possible.

Prefix caching is built directly on the block-based memory model of the KV cache and pairs naturally with continuous batching and chunked prefill to maximize the useful work a single GPU can sustain.

Prefix caching is an inference optimization that reuses computation across separate requests whenever they begin with the same tokens. Many real workloads send prompts that…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners