Prompt Caching

Sovereign AI

Prompt caching lets an inference engine store a frequently reused prompt prefix — system instructions, tool definitions, long documents, or examples — so that subsequent requests read from the cache instead of reprocessing the same tokens. When a new request shares a cached prefix up to a defined breakpoint, the model reuses the stored computation, cutting both latency and cost on the cached portion. It exploits a simple truth about real workloads: most of what you send an LLM is the same thing you sent it last time.

What is actually being cached

A transformer processes a prompt by computing attention keys and values for every token — the KV cache. That computation is deterministic for a given model and token sequence, so if two requests begin with byte-identical prefixes, the second can load the first's KV state and start computing only at the point where the requests diverge. That is all prompt caching is: persisting and re-attaching KV state across requests. The strict prefix requirement explains the cardinal rule of cache-friendly prompt design — put stable content first, volatile content last. A timestamp or session ID placed at the top of a prompt silently breaks caching for everything after it.

The economics on hosted APIs

On hosted APIs that support it (for example Anthropic's Claude API with the cache_control ephemeral marker), writing a prefix to the cache costs modestly more than normal input tokens, while every subsequent hit within the time-to-live is billed at a small fraction of the standard input price — around a tenth. Default ephemeral caches typically live on the order of minutes, refreshed on use, with longer durations available at a higher write cost; exact pricing and TTLs shift, so check current provider documentation before building a cost model. For workloads with a large static context and a small changing tail — agents re-sending tool definitions every turn, chatbots dragging conversation history — the input-token savings routinely reach 70–90%.

Caching on your own hardware

The same mechanics exist locally, without billing attached. Self-hosted inference servers implement prefix or prompt caching so a local LLM does not recompute attention over tokens it has already seen — some engines cache per session, others maintain shared prefix trees across concurrent requests. On your own GPU the resource being saved is time and power rather than dollars: a multi-thousand-token system prompt that took seconds to prefill returns instantly on the next request, which is the difference between a sluggish assistant and a responsive one. The trade is memory — cached KV state occupies VRAM or system RAM — so long stable prefixes and large context windows compete for the same budget.

Designing for the cache

The pattern to internalize is stable prefix, dynamic suffix. System instructions, persona, tool schemas, and reference documents go first, ordered from least to most frequently changed; the user's query and any freshly retrieved material go last. This pairs naturally with a RAG pipeline, where instructions stay fixed while retrieved context changes per query. For sovereign operators the appeal is doubled: structuring prompts this way is the highest-leverage inference optimisation available without touching model weights, and on self-hosted hardware it happens entirely on machines you control — no data leaves, nobody meters the cache.

Two boundaries keep expectations honest. First, caching is exact-prefix matching on tokens: one changed character invalidates everything after it, so templating discipline — identical whitespace, stable ordering, no embedded timestamps — is what separates a 90% hit rate from a 0% one, and cache-hit metrics are worth logging from day one. Second, a cache stores computation, not meaning: it makes the same prompt cheaper, never a bad prompt better, and it does nothing across different models or quantizations, since KV state is architecture-specific. Within those limits the technique is pure profit. A self-hosted assistant that answers from a large fixed knowledge preamble, an agent that re-sends its tool catalog hundreds of times a day, a batch pipeline sharing one system prompt across thousands of items — each is the shape of workload where prompt caching quietly returns most of the compute you were wasting.

Prompt caching lets an inference engine store a frequently reused prompt prefix — system instructions, tool definitions, long documents, or examples — so that subsequent…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners