Definition
A preference dataset is the fuel for modern LLM alignment: a collection of prompts where, for each one, responses are labeled as "chosen" (preferred) versus "rejected" (dispreferred). Methods like RLHF and Direct Preference Optimization learn from these comparisons to make a model produce outputs people actually want.
How they are built
The classic format pairs two responses to the same prompt and records which a human (or an AI judge) preferred. Anthropic's HH-RLHF dataset, released in 2022, contains roughly 169,000 chosen-rejected pairs covering helpfulness and harmlessness, with each line of the file holding one "chosen" and one "rejected" text. UltraFeedback, another widely used set, scores model responses across diverse prompts. Some newer methods, such as KTO (Kahneman-Tversky Optimization), instead use single binary good/bad labels rather than pairs.
Why the dataset is the bottleneck
The quality, diversity, and coverage of a preference dataset largely determine how well alignment works — a model can only learn the preferences its data encodes. Datasets whose responses were written by a different model can also cause distribution shift, which on-policy approaches like Online DPO aim to fix by regenerating data from the current model.
For sovereign AI builders, owning your preference dataset means owning your model's values. Curating it on your own infrastructure — rather than relying on a vendor's hidden labels — keeps the alignment process transparent and under your control.
In Simple Terms
A preference dataset is the fuel for modern LLM alignment: a collection of prompts where, for each one, responses are labeled as “chosen” (preferred) versus…
