Beam Width

Sovereign AI

Beam width is the central parameter of beam search, a decoding strategy for generating text from a language model. Where greedy decoding commits to the single highest-probability token at every step, beam search keeps the top k partial sequences alive at once — and that number k is the beam width. At each step the model expands every surviving sequence by its possible next tokens, scores the resulting candidates by cumulative probability, and prunes the field back down to the best k. It is a structured compromise between greedy's tunnel vision and the impossibility of searching every branch.

The quality-versus-cost trade-off

A wider beam explores more of the search space and is more likely to find a sequence with high overall probability, which can improve quality on tasks that have a clear correct answer — machine translation, speech transcription, and other constrained generation. But the cost scales with k: the server must run, store, and rank k sequences in parallel, multiplying the memory used by the key-value cache and the compute spent per step. A beam width of 1 collapses beam search back into greedy decoding. Crucially, very large widths rarely pay off and can even make open-ended text worse, because maximizing probability tends to favour bland, generic, high-frequency phrasing — the well-documented reason creative generation usually reaches for sampling instead of a wide beam. Beam search also needs a length penalty to stay useful, since raw cumulative probability keeps falling as a sequence grows, which biases an unpenalized beam toward truncated, unnaturally short outputs unless that tendency is corrected.

Beam width in speculative decoding

The term also appears in speculative decoding, a speed technique where a small, fast draft model proposes tokens that the large model then verifies in parallel. Here beam width refers to the number of draft sequences proposed for verification at once. A larger speculative beam raises the odds that a long, acceptable prefix appears among the drafts, letting the big model accept more tokens per verification step and finish a response in fewer expensive forward passes. It is the same core idea — keep several candidate futures alive — applied to throughput rather than to output quality.

Tuning it on your own hardware

For a self-hoster, beam width is a practical knob with a clear default. Most chat and assistant workloads run perfectly well at width 1 (greedy) or with sampling, and reserve wider beams for accuracy-critical, well-defined tasks where the extra GPU memory and latency are justified — because on your own machine you pay that cost directly, in VRAM and in tokens per second. The wider the beam, the fewer concurrent requests a given card can serve, so on constrained hardware the honest choice is usually to keep the beam narrow and spend the saved memory on a larger context window or more parallelism. Decoding strategy also interacts with how a model was trained to behave; see reward model and RLHF for why a well-aligned model often needs little search to produce a good first answer.

Why the default is usually narrow

It surprises newcomers that state-of-the-art chat systems rarely use wide beam search at all. Large, well-tuned models place high probability on strong first choices, so the marginal sequence found by a wider beam is often no better — and sometimes duller — than the greedy or sampled one, at several times the cost. Beam width is therefore best understood as a task-specific tool rather than a global quality dial: turn it up when there is a single right answer worth searching for, and leave it at one when you are chatting, brainstorming, or serving many users on a single GPU. Matching the decoding strategy to the workload is part of running an efficient sovereign inference stack.

Beam width is the central parameter of beam search, a decoding strategy for generating text from a language model. Where greedy decoding commits to the…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners