Definition
Prefix caching is an inference optimization that reuses computation across separate requests whenever they begin with the same tokens. Many real workloads send prompts that share a long identical opening: a fixed system prompt, a reference document, or a block of few-shot examples. Without prefix caching, the model recomputes the attention state for that shared opening on every single request, wasting GPU cycles on work it has already done.
From per-request to cross-request reuse
An ordinary key-value cache stores the attention state within one generation so the model never re-reads its own earlier tokens. Prefix caching, also marketed as prompt caching or context caching, extends that idea across requests. After the first request computes the key/value blocks for a shared prefix, those blocks are kept and handed to any later request whose prompt starts the same way. The new request then only computes the tokens that actually differ. Serving engines detect the overlap automatically; vLLM does it by hashing fixed-size blocks, while SGLang organizes prefixes in a radix tree to find the longest match.
Operational payoff
For chat assistants, retrieval-augmented pipelines, and coding agents that resend a large stable context every turn, prefix caching can slash the time-to-first-token and cut redundant compute substantially. The trade-off is memory, since cached prefixes occupy space that could hold active generations, so engines evict cold prefixes under pressure. In multi-tenant settings, per-request salting prevents one user's cache from being reused by another for privacy.
Prefix caching is built directly on the block-based memory model of the KV cache and pairs naturally with continuous batching to maximize the work a single GPU can sustain.
In Simple Terms
Prefix caching is an inference optimization that reuses computation across separate requests whenever they begin with the same tokens. Many real workloads send prompts that…
