Definition
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a speculative-decoding technique that speeds up large language model inference without changing the model's output distribution. Standard autoregressive decoding produces one token per forward pass, which leaves a GPU underused because each step is memory-bound. EAGLE accelerates this by drafting several candidate tokens cheaply and verifying them in a single pass of the full model, accepting only the prefix that matches what the model would have produced anyway.
How EAGLE drafts
The key insight from the original 2024 paper is that predicting a model's second-to-top-layer feature vectors is more regular than predicting raw tokens. EAGLE runs a small draft head autoregressively at this feature level, feeding in the token from one step ahead to resolve uncertainty. Because the draft head reuses the target model's own internal features rather than running a separate full draft model, it is lightweight and the speedup is lossless.
EAGLE-2 and EAGLE-3
EAGLE-2 introduced dynamic draft trees that adapt the speculation pattern to context, raising acceptance rates. EAGLE-3 abandoned feature prediction for direct token prediction, fused hidden states from low, middle, and high layers of the target model, and added a training-time test that simulates inference conditions during training. Reported speedups reach roughly 6x over vanilla decoding while keeping output identical.
For a self-hoster running models on owned hardware, EAGLE is one of the higher-impact ways to cut response time on a single GPU. Compare it with the simpler, model-free approach in our N-gram speculation entry, and see throughput-optimized serving for how speculative methods interact with batching.
In Simple Terms
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a speculative-decoding technique that speeds up large language model inference without changing the model’s output distribution. Standard…
