Definition
RLHF (Reinforcement Learning from Human Feedback) is the alignment technique that turned raw language models into helpful assistants. Introduced at scale by OpenAI's InstructGPT work in 2022, it fine-tunes a pretrained model so its outputs match what humans actually prefer, rather than merely predicting the next plausible token.
The three-stage pipeline
RLHF classically runs in three stages. First, supervised fine-tuning (SFT) trains the base model on human-written instruction examples. Second, human labelers rank several model outputs for a given prompt, and those rankings train a separate reward model that scores responses with a scalar value. Third, the language model is optimized against that reward model using a reinforcement learning algorithm, usually Proximal Policy Optimization (PPO), which constrains each update so training stays stable. A KL-divergence penalty keeps the tuned model from drifting too far from its starting behavior.
Why it matters for sovereignty
RLHF is the reason most assistant models behave consistently, but it also bakes in the preferences and policies of whoever ran the training. For operators who want models that reflect their own values and run on their own hardware, understanding RLHF clarifies what is and isn't fixed about a downloaded model, and where lighter alternatives fit.
RLHF is computationally heavy because it juggles three models at once, which motivated simpler successors like DPO (Direct Preference Optimization) and AI-feedback variants such as Constitutional AI. It is one stage in the broader project of AI alignment, and it typically follows model pretraining and fine-tuning.
In Simple Terms
RLHF (Reinforcement Learning from Human Feedback) is the alignment technique that turned raw language models into helpful assistants. Introduced at scale by OpenAI’s InstructGPT work…
