Definition
Lookahead decoding is a parallel decoding algorithm that accelerates language-model inference by breaking the strict left-to-right dependency of autoregressive generation. Crucially, it needs no separate draft model and no external data store, which makes it appealing for lean self-hosted setups where running a second model is costly. It is exact: the output matches what ordinary greedy or sampled decoding would have produced.
How it works
The method reframes autoregressive decoding as solving a system of nonlinear equations and applies the classic Jacobi iteration to solve them in parallel. The key insight is that while you cannot reliably predict many consecutive next tokens at once, the model can generate multiple disjoint n-grams in parallel. Lookahead maintains a fixed-size 2D window along the sequence and time axes, splitting each step into a lookahead branch that proposes n-grams and a verification branch that confirms them against the model.
Why it matters
Because it generates and verifies n-grams directly with the target model, lookahead decoding trades extra parallel computation for fewer sequential steps, reducing latency by roughly 1.5x to 2.3x in reported results. The cost is more floating-point operations per step, so the gains depend on having GPU compute to spare relative to memory bandwidth. The technique was introduced by Fu et al. (LMSYS) and presented at ICML 2024.
Compare it with the draft-model approach in Speculative Decoding and the extra-heads approach in Medusa Decoding.
In Simple Terms
Lookahead decoding is a parallel decoding algorithm that accelerates language-model inference by breaking the strict left-to-right dependency of autoregressive generation. Crucially, it needs no separate…
