Online DPO

Sovereign AI

Online DPO is a variant of Direct Preference Optimization in which the preference pairs used for training are generated on-policy — sampled from the model as it trains — rather than read from a static, pre-collected dataset. In the online loop, the current policy produces fresh candidate responses to each prompt, a judge labels which response is preferred, and the policy updates immediately on that signal. Generation, labeling, and optimization run as one continuous stream instead of separate phases.

On-policy versus offline

Standard "offline" DPO learns from a fixed file of chosen and rejected responses that may have been written by an entirely different model. The trained policy never receives feedback on its own current generations, and as it improves it drifts away from the frozen data — a distribution shift that weakens the training signal and can steer the model toward responses the dataset never sampled. Online DPO removes the gap by construction: every comparison is drawn from the live policy, so the preference signal always lands exactly where the model currently is, much as reinforcement-learning methods like PPO generate on-policy completions throughout training. Research comparing the regimes consistently finds on-policy preference learning converges faster and reaches higher final quality than static offline tuning — data freshness, it turns out, matters as much as data volume.

What the loop requires

The price is a heavier training rig. An online setup runs three components concurrently: a sampler generating candidate responses from the current policy (often two or more per prompt), a judge — a trained reward model or an LLM prompted as evaluator — labeling each pair on the fly, and the optimizer applying the DPO update. Inference and training now compete for the same accelerators, checkpoint plumbing must keep the sampler synchronized with the freshest weights, and the judge sits inside the hot loop, so its latency gates throughput. There is also a subtler risk: because the judge is queried continuously on the policy's own outputs, any exploitable quirk in it becomes a target — reward hacking is not exclusive to RLHF, and judge quality is the true ceiling on what online DPO can achieve. Logging deserves care as well: because data is generated and consumed in flight, reproducing a run later means persisting the sampled pairs and judge verdicts as you go.

Online, iterative, and the middle ground

Online DPO is the continuous limit of the idea behind Iterative DPO: where the iterative variant refreshes its dataset in discrete rounds — generate a batch, label it, train, repeat — online DPO folds the refresh into every step. The spectrum is really about staleness tolerance. Discrete rounds are simpler to operate, restartable, and easy to audit between rounds; the streaming loop extracts the most from on-policy freshness at the cost of operational complexity. Many practical pipelines land in between, regenerating data every few hundred steps — mostly-online freshness on an offline-shaped harness.

The self-hosted calculus

For sovereign AI builders aligning models on owned hardware, online DPO buys the highest-quality preference signal available at the price of the most demanding training loop. On a single-GPU rig, running sampler, judge, and optimizer together is often impractical, and the pragmatic choices are iterative DPO in scheduled rounds — most of the on-policy benefit, none of the concurrency burden — or a simpler reference-free objective like SimPO (Simple Preference Optimization), which cuts memory by dropping the reference model entirely. On a multi-GPU homelab, a modest online loop becomes feasible: policy training on one device, judge inference on another, prompts and preferences never leaving your network. That is the deeper appeal — the entire feedback cycle that shapes the model's behavior, from generation to judgment to update, runs on infrastructure you control, encoding your preferences rather than a vendor's.

Online DPO is a variant of Direct Preference Optimization in which the preference pairs used for training are generated on-policy — sampled from the model…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners