Rejection Sampling Fine-Tuning

Sovereign AI

Rejection sampling fine-tuning is an alignment technique in which the model generates several candidate responses for each prompt, a reward model scores them all, and the model is then fine-tuned only on the top-scoring outputs. The lower-scoring candidates are "rejected" and discarded — hence the name. Crucially, the training step uses the same loss as ordinary supervised fine-tuning; what changes is the data, which is now a self-curated set of the model's own best work. It is the simplest serious answer to a central alignment question: how do you distill a reward signal into a model's weights without reinforcement-learning machinery?

The mechanics

For each prompt in a training set, the current model samples K candidate responses — typically with temperature high enough to get genuine diversity. A trained reward model ranks the candidates, and the single best (or the top few) becomes the new training target for that prompt. Fine-tuning on these winner-only examples nudges the policy's probability mass toward its own best behavior: the model already could produce these responses, and training makes them what it produces by default. Run once, this harvests the low-hanging gains; run in rounds — regenerate candidates from the improved model, rescore, retrain — it compounds, which is the recipe formalized as RAFT (Reward rAnked Fine-Tuning) and sometimes called iterative best-of-N fine-tuning. Meta's Llama 2 used exactly this pattern, running rounds of rejection sampling before its PPO-based RLHF stage, with rejection sampling doing much of the heavy lifting. The tunable knobs are K and the sampling temperature: more candidates and more diversity raise the ceiling on what the reward model can find, at linear cost in generation compute — a budget decision rather than an algorithmic one.

Baking inference-time search into the weights

The cleanest way to understand the method: it converts Best-of-N sampling from an inference-time trick into a permanent improvement. Best-of-N gets better answers by generating N candidates and keeping the reward model's favorite — effective, but you pay N-fold compute on every single query, forever. Rejection sampling fine-tuning pays that search cost once, during training, and amortizes it into the weights: after training, a single greedy sample approaches what previously required the full N-candidate search. For anyone serving a model on constrained hardware, shifting cost from every-inference to one-time-training is exactly the right direction.

Strengths and limits

Because it reuses the standard supervised loss, the method is dramatically simpler and more stable than PPO-style reinforcement learning — no value networks, no rollout buffers, no delicate KL schedules — while still distilling the reward model's judgment into the policy. Its limits are equally clear. It only reinforces what the model can already sample: if none of the K candidates exhibits a behavior, no amount of selection teaches it. It learns from positives only, discarding the information in bad responses that contrastive methods exploit. And it inherits the reward model's blind spots — over-optimizing a flawed judge produces confidently judged mediocrity. In mature pipelines it therefore often serves as the robust first stage, with a preference-based method sharpening the result afterward.

Why it suits the sovereign stack

For a self-hoster, this is arguably the most accessible alignment loop that actually works. Every ingredient runs on owned hardware with commodity tooling: batch generation is ordinary inference you can schedule overnight, scoring is a forward pass through a reward model (or a strong local LLM prompted as judge), and training is plain fine-tuning — QLoRA-friendly, single-GPU-friendly. No RL framework, no annotation vendor, no prompts leaving your network. The rejected-but-scored candidates need not go to waste either: chosen-rejected pairs harvested along the way seed a preference dataset for later direct preference tuning. Sample, score, keep the best, train on it — a craftsman's loop for raising a model's quality on hardware you control.

Rejection sampling fine-tuning is an alignment technique in which the model generates several candidate responses for each prompt, a reward model scores them all, and…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners