Speculative Decoding

Sovereign AI

Speculative decoding is an inference optimisation that makes large language models generate text faster without changing what they would have produced. A small, fast draft model proposes several candidate tokens ahead; the large target model then checks all of those proposals in a single parallel forward pass and keeps the longest prefix it agrees with, generating the next token itself to stay on track. Because the target model retains the final say on every token, the output distribution is preserved exactly — it is a lossless speed-up, not an approximation, which distinguishes it from techniques like quantization that trade a little fidelity for efficiency.

Why it speeds things up

Autoregressive generation is normally bottlenecked by memory bandwidth rather than raw compute: producing one token requires streaming the entire set of model weights through the GPU, and doing that once per token leaves the arithmetic units mostly idle. Verifying a batch of draft tokens costs roughly the same weight-streaming as generating a single token, so if the draft model guessed five tokens correctly, you got five tokens for approximately the memory traffic of one. In practice this commonly yields around 2–3× lower inter-token latency. The catch is acceptance rate: when the draft model guesses poorly — unfamiliar domains, unusual formatting, creative writing with many plausible continuations — most proposals get rejected, and the wasted draft work can erode or even erase the benefit. Predictable, structured text (code, boilerplate, repeated technical phrasing) speculates best.

Choosing a draft model

The draft model must share the target model's tokenizer and vocabulary, and ideally its training lineage — a tiny sibling from the same model family is the classic pairing. The size gap is the tuning knob: a very small draft is cheap per proposal but wrong more often, while a larger draft guesses better but eats into the savings. Some approaches skip the separate draft model entirely, letting the target model draft with a reduced portion of its own layers, which avoids keeping a second model in memory at all.

Relevance to local inference

For someone running models on their own machine, speculative decoding is one of the few ways to get meaningfully faster responses without buying a bigger GPU — it spends a little extra VRAM on a draft model to buy back wall-clock time. Local inference engines including llama.cpp support it, letting you pair a tiny draft with the main model you actually want answers from. On a self-hosted assistant that streams answers token by token, the difference is immediately visible in how fast text appears. It stacks with, rather than replaces, quantization and efficient attention implementations, since they attack different bottlenecks.

When to enable it

Speculative decoding shines exactly where a self-hoster lives: batch-of-one, latency-sensitive, interactive use. A single user chatting with a local model leaves the GPU starved for work, which is the idle capacity speculation converts into speed. On a heavily loaded shared server the calculus differs — high-throughput serving already keeps the hardware busy with many requests, and draft verification competes for that capacity. The practical advice is to measure rather than assume: enable it, watch the acceptance rate and tokens-per-second your engine reports on your real workloads, and keep it if the numbers improve. Expect the win to vary by task — large for code and structured output, smaller for open-ended prose — and remember the draft model's memory footprint has to earn its keep on a card where every gigabyte is contested.

Speculative decoding pairs naturally with other serving optimisations such as continuous batching and flash attention. For background on the per-token generation loop it accelerates, see inference.

Speculative decoding is an inference optimisation that makes large language models generate text faster without changing what they would have produced. A small, fast draft…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners