Definition
A reward model is a neural network that takes a prompt and a candidate response and outputs a single scalar score predicting how much a human would approve of that response. It is the component that turns subjective human preferences into a numeric signal an optimizer can chase, and it sits at the heart of Reinforcement Learning from Human Feedback (RLHF).
How it is trained and used
The reward model is usually trained on pairwise comparisons, where annotators see two responses to the same prompt and mark which is better. Using a Bradley-Terry style objective, the model learns to assign higher scores to preferred responses. Once trained, it stands in for human judgment during the reinforcement phase: a policy-optimization algorithm such as PPO updates the language model to maximize the reward model's scores, effectively teaching the model to produce outputs humans would rate highly without a person in the loop for every step.
Strengths and pitfalls
A reward model lets human preference scale to millions of training updates, which is what makes RLHF practical. But it is only a proxy for real human values, and pushing a policy hard against an imperfect proxy is exactly the setup that produces reward hacking and sycophancy. This limitation is part of why newer methods such as Direct Preference Optimization aim to align models without a standalone reward model at all. For self-hosting builders, understanding the reward model clarifies both how today's assistants got their behavior and where that behavior can quietly go wrong.
See reward hacking for how reward models get exploited and Direct Preference Optimization for an approach that removes them.
In Simple Terms
A reward model is a neural network that takes a prompt and a candidate response and outputs a single scalar score predicting how much a…
