Definition
Direct Preference Optimization (DPO) is a 2023 method from Rafailov and colleagues at Stanford that aligns language models to human preferences without the moving parts of full RLHF. Its paper carries the memorable subtitle "Your Language Model Is Secretly a Reward Model," capturing the core insight that the model being trained already encodes a reward signal implicitly.
How it differs from RLHF
Standard RLHF trains a separate reward model and then optimizes the policy against it with reinforcement learning, sampling fresh outputs during training. DPO collapses this into a single step. Given pairs of responses where one is preferred over the other, DPO uses a simple classification-style loss that directly increases the probability of preferred responses and decreases the probability of rejected ones, relative to a frozen reference model. There is no reward model to train and no on-policy sampling loop.
Why practitioners adopted it
Because it removes the reward model and the unstable RL phase, DPO is more stable, cheaper to run, and far easier to reproduce on modest hardware. The original paper showed it matching or exceeding PPO-based RLHF on sentiment control, summarization, and single-turn dialogue. That accessibility made DPO a default choice for open-weight model post-training and for hobbyists tuning models locally.
DPO is a direct successor to RLHF and is commonly applied to open-weight models as a final alignment step after fine-tuning. Like all preference methods, the values it instills depend entirely on who curated the preference data.
In Simple Terms
Direct Preference Optimization (DPO) is a 2023 method from Rafailov and colleagues at Stanford that aligns language models to human preferences without the moving parts…
