Definition
N-gram speculation, also called prompt lookup decoding, is the simplest form of speculative decoding for large language models. Where methods such as EAGLE train a small draft network, n-gram speculation uses no extra model at all. It scans the input prompt and the text generated so far for an n-gram that matches the most recent tokens, then proposes the tokens that followed that match as the speculative draft. The full model verifies the whole draft in a single forward pass, accepting the longest correct prefix.
Where it shines
The technique only helps when the output is likely to repeat substrings from the input. That makes it well suited to summarization, retrieval-augmented generation, code editing, and question answering over a supplied document, where the model frequently quotes the source verbatim. In those input-grounded tasks, reported speedups of 2x to 4x are common, with no change to output quality. On open-ended creative generation, matches are rare and the benefit shrinks toward zero.
Why self-hosters like it
Because it needs no draft model, n-gram speculation costs nothing to deploy, adds negligible memory, and requires no fine-tuning. For someone running inference on their own GPU, it is often the first speculative method to switch on: it is supported out of the box in serving engines like vLLM and is impossible to get wrong since the worst case simply falls back to normal decoding.
N-gram speculation trades sophistication for simplicity compared with feature-based drafting in our EAGLE decoding entry. Both are forms of the broader pattern covered under in-flight batching-style serving optimizations.
In Simple Terms
N-gram speculation, also called prompt lookup decoding, is the simplest form of speculative decoding for large language models. Where methods such as EAGLE train a…
