Definition
Speculative decoding is an inference optimisation that makes large language models generate text faster without changing what they would have produced. A small, fast draft model proposes several candidate tokens ahead; the large target model then checks all of those proposals in a single parallel forward pass and keeps the longest prefix it agrees with, generating the next token itself to stay on track. Because the target model still has the final say on every token, the output distribution is preserved exactly — it is a lossless speed-up, not an approximation.
Why it speeds things up
Autoregressive generation is normally bottlenecked by memory bandwidth: producing one token at a time leaves the GPU underused. Verifying many draft tokens at once makes far better use of the hardware and cuts inter-token latency, commonly yielding 2-3x speedups. The gain depends heavily on how often the draft model guesses correctly; when speculation accuracy is low, the wasted work on rejected tokens can erode the benefit.
Relevance to local inference
For someone running models on their own machine, speculative decoding is one of the few ways to get meaningfully faster responses without buying a bigger GPU. Several local inference engines support it, often pairing a tiny draft model with the main model you actually want answers from.
It pairs naturally with other serving optimisations such as continuous batching and flash attention. For background on the per-token generation it accelerates, see inference.
In Simple Terms
Speculative decoding is an inference optimisation that makes large language models generate text faster without changing what they would have produced. A small, fast draft…
