Definition
Medusa is an inference-acceleration framework that speeds up language-model generation by adding multiple lightweight prediction heads on top of the model's final hidden state. Each head guesses a token several positions into the future, so the model drafts a short tree of candidate continuations in one pass instead of generating strictly one token at a time. For a sovereign operator constrained by local GPU throughput, techniques like Medusa squeeze more tokens per second out of fixed hardware without changing the model's output distribution.
How it works
Where classic speculative decoding needs a separate smaller draft model, Medusa eliminates that by training extra decoding heads on the original model, which is otherwise frozen during training. At inference, each head produces several top predictions for its position; these are assembled into candidate sequences and verified in parallel using a tree-attention mask that lets a token attend only to its predecessors. Accepted tokens advance the sequence; rejected branches are discarded, preserving correctness.
Why it matters
The appeal is simplicity and self-containment: there is no second model to host, version, or keep aligned, only a set of small heads fine-tuned onto the model you already run. This makes it attractive for self-hosted deployments where managing two models is operationally awkward. The original method was introduced in the 2024 paper "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads" by Cai et al.
Medusa is a variant of the broader idea described under Speculative Decoding, and a sibling of the draft-free approach in Lookahead Decoding.
In Simple Terms
Medusa is an inference-acceleration framework that speeds up language-model generation by adding multiple lightweight prediction heads on top of the model’s final hidden state. Each…
