Definition
Best-of-N (BoN) sampling is an inference-time alignment method: instead of returning the first response a model produces, you sample N candidates and use a reward model to pick the highest-scoring one. It is a simple, training-free way to raise output quality by spending more compute at generation time.
The quality-versus-divergence tradeoff
Raising N generally improves the selected response, but it also pushes the resulting distribution away from the base model. A commonly cited expression places the KL divergence between the best-of-N policy and the reference policy at roughly log(n) − (n−1)/n — an upper bound that grows with N. Push N too high and the method can exploit flaws in the reward model, a failure called reward over-optimization or reward hacking, where outputs score well on the proxy but are actually worse.
Inference-time, not training-time
The key trait of best-of-N is that it changes nothing about the weights — it is pure inference-time selection. That makes it a useful baseline and a knob you can turn on any deployed model, trading more sampling for better answers. When the chosen best responses are instead fed back into training, the method becomes rejection sampling fine-tuning.
For self-hosters, best-of-N is the cheapest alignment lever available: it needs only a reward model and extra GPU cycles, no retraining. It depends on the same reward signal used to build a preference dataset.
In Simple Terms
Best-of-N (BoN) sampling is an inference-time alignment method: instead of returning the first response a model produces, you sample N candidates and use a reward…
