Definition
Reward shaping is a reinforcement-learning technique that supplements an environment's sparse or delayed reward with extra intermediate signals, helping a model learn faster by rewarding progress along the way rather than only at the finish line. In language-model training, naive shaping is a common cause of reward hacking, so it must be applied carefully.
Potential-based shaping and policy invariance
The foundational result comes from Ng, Harada, and Russell (1999), who showed that if the extra reward is constructed as the difference of a potential function over states, written F(s, s') = gamma times Phi(s') minus Phi(s), then the set of optimal policies stays unchanged. This property, called policy invariance, means the shaping speeds up learning without secretly redefining the goal. Apply shaping rewards arbitrarily and you risk distracting the learner, teaching it to chase the bonus rather than the real objective.
Relevance to aligning language models
When training a model with reinforcement learning, the reward model is itself an imperfect proxy for human intent. Hand-crafted shaping terms, such as penalizing excessive length, rewarding correct formatting, or encouraging verifiable answers, are routinely added to the main reward. Poorly designed shaping is one of the surest ways to induce degenerate behavior, where the model exploits the shaped signal while ignoring what you actually wanted. For anyone fine-tuning a self-hosted model, disciplined reward design is the difference between a model that reasons well and one that games its grader.
Shaping pairs with the broader policy-gradient machinery and feeds the advantage estimate that updates the policy.
In Simple Terms
Reward shaping is a reinforcement-learning technique that supplements an environment’s sparse or delayed reward with extra intermediate signals, helping a model learn faster by rewarding…
