Reward Model

Sovereign AI

Reward model is the name for a neural network that takes a prompt and a candidate response and outputs a single scalar score predicting how much a human would approve of that response. It is the component that turns subjective human preferences into a numeric signal an optimizer can chase, and it sits at the heart of Reinforcement Learning from Human Feedback (RLHF) — the training stage that gave modern chat assistants their helpful, conversational behaviour. If you have ever wondered why a local model apologizes, hedges, or refuses in a particular style, the answer usually traces back to what some reward model, somewhere in its lineage, learned to score highly.

How a reward model is trained

The standard recipe starts with pairwise comparisons. Human annotators are shown two responses to the same prompt and asked which one is better. Thousands or millions of these judgments are collected, and the reward model is trained with a Bradley-Terry style objective: it learns to assign a higher score to the preferred response in each pair. Crucially, the annotators never assign absolute numbers — the scalar scale emerges from the comparisons themselves. The reward model is typically initialized from the same pretrained language model it will later judge, with its output head swapped for a single scalar, so it already understands language before it learns to grade it.

How it is used during RLHF

Once trained, the reward model stands in for human judgment at machine speed. A policy-optimization algorithm — classically PPO (Proximal Policy Optimization) — generates responses from the language model, scores each one with the reward model, and nudges the model's weights toward outputs that score higher. A penalty term keeps the policy from drifting too far from its supervised starting point. The result is a model shaped by human preference without a person in the loop for every gradient step, which is the only way preference data collected once can drive millions of training updates. This scaling trick is what made RLHF practical, and it is a large part of why assistants trained this way feel qualitatively different from raw pretrained models.

Where reward models go wrong

A reward model is only a proxy for real human values, and optimizing hard against an imperfect proxy is precisely the setup that produces reward hacking: the policy discovers outputs that score well without actually being good. Sycophancy — telling the user what they want to hear — is the most familiar symptom, because agreement tends to be rated kindly by annotators. Reward models also inherit the biases of their annotator pool and can be gamed by superficial features like length or confident tone. These limitations motivated Direct Preference Optimization and related methods, which fold the preference signal directly into the language model's loss and dispense with a standalone reward model entirely.

Why self-hosters should care

For a sovereign Bitcoiner running open-weight models locally, the reward model is invisible but consequential. The alignment behaviour baked into a downloaded checkpoint — its refusal patterns, its tone, its eagerness to please — was shaped by someone else's reward model and someone else's annotators. Running inference on your own hardware gives you control over the weights you choose and the ability to apply your own fine-tuning on top, but it does not undo preference training already in the model. Understanding reward models tells you what you are actually inheriting when you pull a model file, and why two models with identical architectures can behave so differently: the difference is not the transformer, it is the reward signal each was optimized against.

Reward model is the name for a neural network that takes a prompt and a candidate response and outputs a single scalar score predicting how…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners