Definition
Policy gradient is a family of reinforcement-learning methods that optimize a model's behavior, its policy, directly by following the gradient of expected reward. Rather than learning the value of every possible action first and then acting greedily, policy-gradient methods adjust the probabilities of actions to make high-reward outcomes more likely. This direct approach is the mathematical foundation beneath nearly every modern algorithm used to align language models.
The policy gradient theorem and REINFORCE
The policy gradient theorem expresses the gradient of expected return as the expectation of the gradient of the log-probability of an action multiplied by the return that followed it. In plain terms: if a choice led to good outcomes, increase its probability; if bad, decrease it. The earliest concrete algorithm, REINFORCE, applies this using complete-episode Monte Carlo returns. It is elegant but suffers from high variance, because a single noisy episode can swing the update wildly. The standard remedy is subtracting a baseline, the expected average performance, so the model learns from how much better or worse an action did than typical.
Why it underlies LLM alignment
For language models, each generated token is an action and the whole response is a trajectory. Treating text generation as a policy-gradient problem is what makes RLHF possible. The variance problem REINFORCE exposed is exactly why later methods added advantage estimation and clipping. Understanding policy gradients demystifies why algorithms like PPO and GRPO look the way they do.
Every rollout a model generates becomes a training signal in this framework.
In Simple Terms
Policy gradient is a family of reinforcement-learning methods that optimize a model’s behavior, its policy, directly by following the gradient of expected reward. Rather than…
