Definition
Beam width is the central parameter of beam search, a decoding strategy used to generate text from a language model. Where greedy decoding commits to the single highest-probability token at every step, beam search keeps the top k partial sequences alive at once. That number k is the beam width. At each step the server expands every surviving sequence by all possible next tokens, scores the resulting candidates by cumulative probability, and prunes back down to the best k.
The quality-versus-cost trade-off
A wider beam explores more of the search space and is more likely to find a high-probability overall sequence, which can improve quality on tasks with a clear correct answer such as translation or constrained generation. But the cost grows with k: the server must run, store, and rank k sequences in parallel, multiplying memory for the key-value cache and compute per step. A beam width of 1 reduces beam search to greedy decoding; very large widths rarely pay off and can even worsen open-ended generation by favouring bland, high-frequency text.
Beam width in speculative decoding
The term also appears in speculative decoding, where it refers to the number of draft token sequences proposed for verification. A larger speculative beam raises the chance that the longest acceptable prefix is among the drafts, letting the model accept more tokens per step and finish in fewer steps.
For a self-hoster, beam width is a tuning knob: most chat and assistant workloads run fine at width 1, reserving wider beams for accuracy-critical tasks where the extra GPU cost is justified. See throughput-optimized serving for how decoding choices affect overall capacity.
In Simple Terms
Beam width is the central parameter of beam search, a decoding strategy used to generate text from a language model. Where greedy decoding commits to…
