Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Prefix Caching

Sovereign AI

Definition

Prefix caching is an inference optimization that reuses computation across separate requests whenever they begin with the same tokens. Many real workloads send prompts that share a long identical opening: a fixed system prompt, a reference document, or a block of few-shot examples. Without prefix caching, the model recomputes the attention state for that shared opening on every single request, wasting GPU cycles on work it has already done.

From per-request to cross-request reuse

An ordinary key-value cache stores the attention state within one generation so the model never re-reads its own earlier tokens. Prefix caching, also marketed as prompt caching or context caching, extends that idea across requests. After the first request computes the key/value blocks for a shared prefix, those blocks are kept and handed to any later request whose prompt starts the same way. The new request then only computes the tokens that actually differ. Serving engines detect the overlap automatically; vLLM does it by hashing fixed-size blocks, while SGLang organizes prefixes in a radix tree to find the longest match.

Operational payoff

For chat assistants, retrieval-augmented pipelines, and coding agents that resend a large stable context every turn, prefix caching can slash the time-to-first-token and cut redundant compute substantially. The trade-off is memory, since cached prefixes occupy space that could hold active generations, so engines evict cold prefixes under pressure. In multi-tenant settings, per-request salting prevents one user's cache from being reused by another for privacy.

Prefix caching is built directly on the block-based memory model of the KV cache and pairs naturally with continuous batching to maximize the work a single GPU can sustain.

In Simple Terms

Prefix caching is an inference optimization that reuses computation across separate requests whenever they begin with the same tokens. Many real workloads send prompts that…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners