N-Gram Speculation

Sovereign AI

N-gram speculation, also called prompt lookup decoding, is the simplest form of speculative decoding for large language models. Where methods such as EAGLE train a small draft network to guess upcoming tokens, n-gram speculation uses no extra model at all. It scans the input prompt and the text generated so far for an n-gram that matches the most recent tokens, then proposes the tokens that followed that earlier match as the speculative draft. The full model verifies the entire draft in a single forward pass and accepts the longest correct prefix — so output is provably identical to normal decoding, just produced faster.

How the trick works

Autoregressive generation is memory-bandwidth-bound: producing one token requires streaming all the model's weights through the GPU, and producing ten tokens one at a time streams them ten times. Verification flips the economics. Checking ten candidate tokens at once is a single pass — barely more expensive than generating one token — because the weights are read once and applied to all positions in parallel. Speculative decoding exploits this asymmetry: guess cheaply, verify in bulk. N-gram speculation makes the guessing step literally free by using string matching as the draft model. If the last few generated tokens also appear earlier in the context, the continuation that followed them before is proposed as the continuation now. When the model is quoting, copying, or lightly editing its source material, that guess is right remarkably often.

Where it shines and where it stalls

The technique only helps when output is likely to repeat substrings of the input. That makes it well suited to summarization, retrieval-augmented generation, code editing, and question answering over a supplied document — workloads where the model frequently reproduces source text verbatim. In these input-grounded tasks, reported speedups of 2x to 4x are common, with zero change in output quality because verification guarantees exactness. On open-ended creative generation there is nothing to copy from, matches are rare, and the benefit shrinks toward zero — though never below it, since a failed draft simply falls back to ordinary decoding at negligible cost.

Why self-hosters flip it on first

Because it needs no draft model, n-gram speculation costs nothing to deploy: no extra weights in VRAM, no fine-tuning, no compatibility matrix between draft and target models. For someone running inference on their own GPU, it is usually the first speculative method worth enabling — serving engines like vLLM support it out of the box behind a single configuration flag, and it is essentially impossible to get wrong. That profile fits the sovereign-stack workload pattern especially well: local assistants spend most of their time answering questions about supplied documents, summarizing notes, and editing code — precisely the copy-heavy tasks where prompt lookup pays. On a homestead inference box where every watt is accounted for, the same logic that makes a miner chase joules per terahash applies to tokens per joule, and a free 2x on grounded tasks is the cheapest efficiency win available.

Tuning is refreshingly shallow. The two knobs are the n-gram length used for matching and the number of tokens proposed per draft; short matches propose more often but get rejected more, longer matches are rarer but stickier. Defaults are sensible, and because verification guarantees correctness, the worst mis-tuning costs only speed. Measure acceptance rate on your real workload for an afternoon and you know everything the method has to tell you.

N-gram speculation trades sophistication for simplicity compared with the learned, feature-level drafting described in our EAGLE decoding entry — EAGLE reaches higher acceptance rates on general text, but costs a trained draft head and setup effort. Both slot into the broader family of serving-side throughput optimizations covered under in-flight batching. A reasonable local-inference posture: enable prompt lookup everywhere, and graduate to a trained draft model only when profiling shows your workload deserves it.

N-gram speculation, also called prompt lookup decoding, is the simplest form of speculative decoding for large language models. Where methods such as EAGLE train a…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners