Scaling Laws

Sovereign AI

Scaling laws are empirical relationships showing that a language model's prediction error (loss) decreases smoothly and predictably as you increase three quantities: the number of model parameters, the size of the training dataset, and the amount of compute spent training. Documented by Kaplan and colleagues at OpenAI in 2020, these relationships hold as power laws across many orders of magnitude — loss falls roughly as a fixed fractional power of each input, so doubling a quantity yields a consistent, forecastable improvement. Plotted on log-log axes, the curves are strikingly straight lines, which is what made them useful as engineering instruments rather than curiosities.

Why predictability changed the field

Before scaling laws, the payoff of building a bigger model was uncertain — a lab could spend enormous sums and get a marginal improvement, or a leap, with no way to know in advance. The discovery that loss follows a clean curve meant labs could extrapolate: measure performance at small scale, fit the curve, and predict how a model a hundred times larger would perform before spending the money to train it. This turned model development from research guesswork into capital planning, and it directly motivated the race to ever-larger models. It also shifted the industry's center of gravity from architectural cleverness toward scale itself, since the curves suggested that more parameters, data, and compute reliably beat most clever tricks at fixed scale.

The Chinchilla correction

Kaplan's original recipe emphasized growing model size faster than dataset size for a fixed compute budget. A 2022 follow-up from DeepMind — the Chinchilla study — re-fit the curves with better methodology and reached a different allocation: parameters and training tokens should scale roughly in step, and the models of that era were substantially undertrained for their size. A smaller model trained on much more data matched or beat larger, data-starved peers at the same compute cost. This Chinchilla-optimal recipe reshaped training practice, and the industry then pushed past it deliberately: "overtraining" a small model on far more tokens than compute-optimal produces a model that is cheaper to run forever, trading one-time training cost for permanent inference savings.

Limits — and why home users benefit

Scaling laws describe loss on the training objective, not usefulness on real tasks; downstream capability can improve unevenly, sometimes appearing to jump in the discontinuities discussed as emergent abilities. The laws are empirical observations, not physics: they can bend as high-quality data runs short, as architectures change, or as post-training methods (instruction tuning, reinforcement learning, distillation) add capability that pretraining curves never measured. Compute spent at inference time — letting a model reason longer — has emerged as a complementary scaling axis alongside training compute.

For the sovereign builder, the overtraining era is quietly the best news in the story. The same economics that push labs to overtrain small models is what produced capable open-weight models that fit on consumer hardware: a heavily trained mid-size model, squeezed further with quantization, now delivers what frontier systems delivered only a few years earlier — running entirely on a machine you own.

Reading the curves like an operator

There is a familiar rhyme here for miners: scaling laws played the role in AI that Moore's Law played in silicon — a predictable improvement curve that let capital plan years ahead, followed by a maturing phase where the easy gains thin out and cleverness moves elsewhere. The operator's takeaway from both stories is the same: never anchor decisions to the curve's past slope. For a home-lab builder, that means sizing hardware for the models that exist rather than the ones extrapolation promises, and expecting the frontier of usefulness-per-watt to keep shifting toward smaller, better-trained, more aggressively compressed models. The curves reward whoever measures — fit your own little scaling law across a few model sizes on your actual task, and let that local evidence, not headlines, pick what you run.

Scaling laws are empirical relationships showing that a language model’s prediction error (loss) decreases smoothly and predictably as you increase three quantities: the number of…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners