Definition
Iterative DPO is the practice of running Direct Preference Optimization not once over a fixed dataset, but in repeated rounds, where fresh preference pairs are sampled from the most recent version of the model before each round. This keeps the training data close to what the model currently produces, addressing a core weakness of one-shot preference tuning.
Closing the distribution gap
Plain DPO trains on a static dataset whose responses were often written by some other model. As the policy improves, its own outputs drift away from that frozen data — a distribution shift that can lead the model to favor out-of-distribution responses. Iterative DPO refreshes the reference model and re-samples preference pairs from the current policy at each step, so the model always learns from feedback on text it would actually generate.
Self-rewarding loops
A notable use is the Self-Rewarding Language Models framework, where the model both generates candidate responses and judges them via LLM-as-a-judge prompting, then trains on those self-labeled pairs through iterative DPO. This reduces dependence on external human annotation, letting an aligned model bootstrap further alignment from its own outputs.
For sovereign AI builders, iterative DPO is a path to continual self-improvement on owned hardware without an external annotation vendor in the loop. It is closely related to Online DPO and depends on a well-formed preference dataset at each round.
In Simple Terms
Iterative DPO is the practice of running Direct Preference Optimization not once over a fixed dataset, but in repeated rounds, where fresh preference pairs are…
