Definition
Prompt caching lets an inference engine store a frequently reused prompt prefix (system instructions, tool definitions, long documents, or examples) so that subsequent requests read from the cache instead of reprocessing the same tokens. When a new request shares a cached prefix up to a defined breakpoint, the model reuses the stored computation, reducing both latency and cost on the cached portion.
How the economics work
On hosted APIs that support it (for example Anthropic's Claude API with the cache_control ephemeral marker), writing to the cache costs slightly more than a normal input token, but every later request that hits the cache within the time-to-live pays only a small fraction (around a tenth) of the standard input price. The default ephemeral cache typically lasts about five minutes, with longer durations available at higher write cost. In practice this can cut input token spend by 70-90% for workloads with a large static context and a small changing tail.
Caching in self-hosted inference
The same idea exists locally. A local LLM server reuses the KV cache for a shared prefix so it does not recompute attention over tokens it has already seen this session, which is the on-device equivalent of prompt caching and a key reason long, stable context windows stay affordable.
For sovereign operators running their own inference, structuring prompts as a stable cached prefix plus a small dynamic suffix is the single highest-leverage cost optimisation, and it keeps all data on your own hardware. It pairs naturally with a RAG pipeline where the retrieved context changes but the instructions do not.
In Simple Terms
Prompt caching lets an inference engine store a frequently reused prompt prefix (system instructions, tool definitions, long documents, or examples) so that subsequent requests read…
