Best-of-N Sampling

Sovereign AI

Best-of-N (BoN) sampling is an inference-time alignment method: instead of returning the first response a model produces, you sample N candidate responses and use a reward model to pick the highest-scoring one. It is a simple, training-free way to raise output quality by spending more compute at generation time rather than more effort at training time — and its very simplicity is why it remains the baseline every fancier alignment technique gets measured against.

The mechanics

The recipe has three ingredients: a base model that generates, a sampling temperature high enough that the N candidates actually differ, and a reward model — a scorer trained to predict which of two responses a human would prefer. Draw N completions for the same prompt, score each, return the argmax. Nothing about the base model changes; the improvement comes purely from selection pressure. With N = 4 the gain is already visible on most benchmarks; typical deployments use N between 4 and 64, and research setups push into the hundreds to study scaling behavior.

The quality-versus-divergence trade-off

Raising N generally improves the selected response, but it also pushes the resulting output distribution away from the base model. A commonly cited expression bounds the KL divergence between the best-of-N policy and the reference policy at roughly log(n) − (n−1)/n, a quantity that grows — slowly — with N. That slow growth is the good news: BoN buys quality at an unusually gentle divergence cost compared with heavy-handed fine-tuning. The bad news arrives when N gets large relative to the reward model's reliability. Selection pressure amplifies whatever the scorer rewards, including its mistakes: verbosity bias, sycophancy, confident-sounding wrongness. Push N too high and you get reward over-optimization, or reward hacking — outputs that score beautifully on the proxy while actually getting worse by human judgment. The reward model, not the compute budget, is almost always the binding constraint.

Inference-time, not training-time

The defining trait of best-of-N is that it changes nothing about the weights — it is pure inference-time selection, a knob you can turn on any deployed model today and turn off tomorrow. That makes it the natural first rung on the test-time-compute ladder. When the selected best responses are instead fed back into training, the method becomes rejection sampling fine-tuning, which bakes the selection pressure into the weights permanently — same idea, different point of application.

Two refinements are worth knowing by name. Weighted best-of-N replaces the hard argmax with reward-weighted selection, softening the over-optimization pressure at large N. And process-reward variants score intermediate reasoning steps rather than only final answers, which matters for math and code where a wrong path can still stumble into a right-looking result. Both keep the core property intact: the base model's weights are never touched, so nothing about the intervention is permanent or hidden.

For self-hosters, best-of-N is the cheapest alignment lever on the shelf: it needs only a reward model and spare GPU cycles, no retraining pipeline, no preference-data collection of your own. The economics suit local hardware well — sampling N candidates batches efficiently, and overnight or idle-time generation makes the marginal cost of extra samples nearly zero on a machine you already own. A practical local recipe: run BoN with a compact open reward model over your base model's outputs for tasks where quality matters more than latency (drafting, code generation, summaries for the record), keep N modest — single digits to low tens — and treat any suspiciously reward-model-pleasing output as the warning sign it is. The reward signal itself comes from the same machinery described in our preference dataset entry, and the quality ceiling of everything downstream — BoN included — is set right there, at the data.

Best-of-N (BoN) sampling is an inference-time alignment method: instead of returning the first response a model produces, you sample N candidate responses and use a…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners