Top-k Sampling

Sovereign AI

Top-k sampling is a decoding strategy that limits a language model's choice of next token to the k highest-probability candidates at each step. At every position, the model produces a probability distribution over its entire vocabulary — tens of thousands of tokens, most of them wildly inappropriate but each carrying a sliver of probability. Top-k zeroes out everything below the k-th ranked candidate, renormalizes the survivors, and samples from that reduced set. By cutting off the long tail of unlikely tokens, it reduces the chance of the model wandering into incoherent text while keeping enough options to stay varied and natural.

Why cut the tail at all

Pure sampling from the full distribution occasionally lands on a genuinely terrible token — and a single bad token compounds, since the model must then continue from a sentence that has already gone off the rails. The opposite extreme, always picking the single most likely token (greedy decoding), produces text that is safe but repetitive and lifeless. Sampling strategies exist to walk the line between those failure modes, and top-k was one of the first practical answers: keep the plausible candidates, discard the noise floor, and let randomness operate only within the sensible set.

How the k parameter behaves

The single parameter k sets how many candidates survive the cut. A small k (5 or 10) makes output focused and conservative; a larger k (40 or more — 40 is a common default in local runtimes) allows more diversity and surprise. The weakness is that k is a fixed count, blind to the shape of the distribution. When the model is highly confident — one token carries almost all the probability — a k of 40 still admits 39 poor candidates that sampling can occasionally select. When the model is genuinely uncertain — say, choosing among 100 reasonable continuations of a story — the same k arbitrarily amputates 60 good options. The right cutoff depends on the moment, but k cannot adapt. That rigidity is exactly what top-p (nucleus) sampling fixes by keeping the smallest set of tokens whose cumulative probability exceeds a threshold, letting the candidate count expand and contract with the model's confidence.

Use in local inference

Top-k is one of the oldest and most widely supported sampling controls, exposed by essentially every local runtime, including llama.cpp and Ollama. In practice it is layered with temperature and top-p, and the order matters: top-k typically prunes to a fixed candidate count first, top-p then trims by cumulative probability, and temperature reshapes what remains. A common recipe — moderate k, top-p around 0.9, temperature tuned to taste — uses top-k as a cheap hard ceiling on candidates while top-p does the adaptive work. Tuning these together is part of getting good, repeatable behaviour from a model you run yourself: factual and code tasks reward tighter settings, creative writing rewards looser ones. Because the settings live on your machine during inference, you can pin them per task instead of accepting a provider's one-size-fits-all default — a small but real perk of the sovereign stack.

See beam search for the deterministic, search-based alternative used when there is a single best answer to find rather than a voice to vary.

Top-k sampling is a decoding strategy that limits a language model’s choice of next token to the k highest-probability candidates at each step. At every…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners