Definition
Online DPO is a variant of Direct Preference Optimization in which the preference pairs used for training are generated on-policy — sampled from the model as it trains — instead of being read from a static, pre-collected dataset. A reward model or judge then labels which fresh response is preferred, and the policy updates on that signal continuously.
On-policy versus offline
Standard "offline" DPO learns from a fixed file of chosen and rejected responses that may have been written by an entirely different model. The trained policy never sees feedback on its own current generations, so it suffers distribution shift. Online DPO removes that gap by always drawing comparisons from the live policy, much as reinforcement-learning methods like PPO generate on-policy completions during training. Research consistently finds on-policy preference learning converges faster and reaches higher quality than static offline tuning.
Online, iterative, and the tradeoff
Online DPO is the continuous form of the idea behind Iterative DPO: where iterative DPO refreshes data in discrete rounds, online DPO folds generation, labeling, and updating into a streaming loop. The cost is added complexity — you must run sampling and a reward signal during training rather than just reading a file.
For self-hosted alignment, online DPO buys higher data quality at the price of a heavier training loop, a tradeoff worth weighing against simpler reference-free options like SimPO (Simple Preference Optimization).
In Simple Terms
Online DPO is a variant of Direct Preference Optimization in which the preference pairs used for training are generated on-policy — sampled from the model…
