Policy Gradient

Sovereign AI

Policy gradient is a family of reinforcement-learning methods that optimize a model's behavior — its policy — directly by following the gradient of expected reward. Rather than learning the value of every possible action first and then acting greedily, policy-gradient methods adjust the probabilities of actions to make high-reward outcomes more likely. This direct approach is the mathematical foundation beneath nearly every modern algorithm used to align language models, which is why the term keeps surfacing wherever fine-tuning with feedback is discussed.

The policy gradient theorem and REINFORCE

The policy gradient theorem expresses the gradient of expected return as the expectation of the gradient of the log-probability of an action multiplied by the return that followed it. In plain terms: if a choice led to good outcomes, increase its probability; if bad, decrease it. The earliest concrete algorithm, REINFORCE, applies this using complete-episode Monte Carlo returns. It is elegant but suffers from high variance, because a single noisy episode can swing the update wildly — one lucky rollout can reinforce a mediocre behavior, one unlucky one can punish a good habit.

Taming the variance

The standard first remedy is subtracting a baseline — the expected average performance — so the model learns from how much better or worse an action did than typical, without changing the gradient's expected direction. Extending the baseline into a learned value function yields the actor-critic family, where a critic estimates expected return and the actor is updated using the difference. That difference is the advantage, and computing it well is its own sub-field: see advantage estimation. Layered on top come trust-region ideas — keeping each update small enough that the data collected under the old policy remains a valid guide — which is precisely the problem PPO solves with its clipped objective.

Why it underlies LLM alignment

For language models, each generated token is an action, the whole response is a trajectory, and a reward model or verifiable check supplies the return. Treating text generation as a policy-gradient problem is what makes RLHF possible: the model samples responses, each rollout is scored, and the log-probabilities of the tokens that produced high scores are nudged upward. The variance problem REINFORCE exposed is exactly why practical LLM pipelines add advantage estimation, clipping, and a KL-divergence penalty against a reference model. Newer variants such as GRPO simplify the machinery by using the group of sampled responses to a single prompt as its own baseline, dropping the separate critic entirely.

The intuition worth keeping

The family's main practical constraint is that it is on-policy: the gradient is only valid for data sampled from the current policy, so experience goes stale the moment the model updates. This is why LLM alignment alternates generation and optimization in tight cycles, and why sample efficiency — squeezing more learning from each batch of rollouts before they expire — drives so much of the algorithmic engineering in this space. Generation, not gradient math, is usually the wall-clock bottleneck of an alignment run.

Strip away the notation and the method is craftsmanlike in spirit: try things, measure honestly, do more of what worked, less of what didn't, and never lurch so far on one measurement that you can no longer trust your footing. Every acronym in the alignment literature is an engineering answer to one of two questions — how to measure "what worked" with less noise, and how to move without overstepping. Understanding policy gradients demystifies why algorithms like PPO and GRPO look the way they do, and equips anyone running their own fine-tuning experiments to read a training curve and know which of the two problems they are staring at.

Policy gradient is a family of reinforcement-learning methods that optimize a model’s behavior — its policy — directly by following the gradient of expected reward.…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners