Grokking

Sovereign AI

Grokking is a training phenomenon in which a neural network first memorizes its training data — reaching near-perfect training accuracy while performing poorly on held-out examples — and then, after a long period of apparently stalled progress, abruptly begins to generalize, with test accuracy climbing sharply. First reported by Power and colleagues in 2022 on small models trained on modular-arithmetic tasks, grokking is striking because genuine generalization arrives long after overfitting, sometimes after orders of magnitude more training steps than it took to fit the data. The name borrows Robert Heinlein's word for understanding something so completely that it becomes part of you, which captures the sense that the model finally grasps the rule rather than the examples.

What it reveals about learning

The naive intuition is that a model which has fit its training set perfectly has finished learning. Grokking contradicts this. Beneath a training loss that looks frozen near zero, the network keeps quietly reorganizing its internal representations, eventually discovering the underlying rule rather than a lookup table of memorized answers. Researchers link the transition to regularization pressure — weight decay in particular — slowly pushing the model toward a simpler, lower-norm solution once memorization no longer reduces the loss. In that reading, grokking is the moment a compact, general circuit finally out-competes a bulky, memorized one that happened to fit the data first. Interpretability studies of the arithmetic case have even reverse-engineered the tidy algorithm the network converges on, showing that the general solution really is structurally different from the memorized one, not just better-regularized.

Why it matters for self-hosters

Grokking is a clean laboratory example of the gap between memorization and true generalization — the same gap that separates a model that has genuinely learned a concept from one that has merely pattern-matched its corpus. It carries a practical warning: early-stopping on a validation plateau can quit training just before the model would have generalized, so a flat curve is not always a signal to stop. For a sovereign operator fine-tuning an open-weight model on their own hardware, the lesson is that a checkpoint's benchmark score is a snapshot, not a verdict, and that longer runs with proper regularization sometimes pay off suddenly rather than smoothly. It also argues for holding onto intermediate checkpoints instead of assuming the latest is always best, and for treating a stubborn plateau as a question rather than a conclusion.

Where the research is headed

Grokking has become a favourite testbed precisely because it is small enough to study end to end. Work since 2022 has probed which ingredients trigger it — dataset size, weight decay strength, and the ratio of training to validation data all matter — and has connected the sudden jump to the geometry of the loss landscape. The open question is how much of this scales: if a two-layer transformer can hide a fully-formed algorithm behind a plateau, larger systems may be doing something similar in ways current metrics cannot see. That uncertainty is exactly why loss curves alone are an incomplete picture of what a model knows.

Where it sits in the wider picture

A useful mental picture is that training explores a landscape of possible solutions, and memorization is simply the first low valley the optimizer falls into. The general solution sits in a different, harder-to-reach valley that only becomes attractive once regularization keeps penalizing the bulky memorized weights. Seen this way, the long plateau is not wasted computation but a slow drift across that landscape toward better structure. The practical habit that follows is to log the gap between training and validation performance over long horizons and resist declaring a run finished the moment training loss flattens.

Grokking sits alongside emergent abilities as a case where capability arrives abruptly rather than gradually, and both complicate naive readings of scaling laws. It also sharpens the debate over whether next-token prediction builds a genuine world model or an elaborate, well-disguised memory. Taken together, these results caution against reading too much certainty into a single training curve.

Grokking is a training phenomenon in which a neural network first memorizes its training data — reaching near-perfect training accuracy while performing poorly on held-out…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners