Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

GRPO (Group Relative Policy Optimization)

Sovereign AI

Definition

Group Relative Policy Optimization (GRPO) is a reinforcement-learning algorithm introduced by DeepSeek in its DeepSeekMath work and later used to train the DeepSeek-R1 reasoning model. It is a streamlined cousin of PPO designed to teach language models complex, multi-step reasoning more efficiently. Its defining trick is eliminating the separate value (critic) network that PPO relies on, which roughly cuts the memory footprint of training.

Group-relative advantage

For each prompt, GRPO samples a group of candidate responses from the current policy and scores each one with a reward function. Instead of estimating an absolute baseline with a critic, it normalizes the rewards within the group, subtracting the group mean and dividing by the standard deviation. A response that beats its peers gets a positive advantage; one that lags gets a negative one. The policy is then pushed toward the better-than-average answers. The group itself acts as a self-calibrating baseline, which is why no value network is needed.

Why it suits reasoning and local training

Reasoning tasks reward long chains of thought that are hard for a critic to evaluate mid-stream. By grading whole answers relative to siblings, GRPO sidesteps that problem and pairs naturally with verifiable rewards, such as whether a math answer is correct. For practitioners training or fine-tuning models on modest, self-hosted hardware, dropping the critic is a meaningful resource win, lowering the barrier to running alignment experiments outside a large data center.

GRPO still uses a KL penalty against a reference model and depends on a good reward signal to avoid degenerate solutions.

In Simple Terms

Group Relative Policy Optimization (GRPO) is a reinforcement-learning algorithm introduced by DeepSeek in its DeepSeekMath work and later used to train the DeepSeek-R1 reasoning model.…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners