Loss Scaling

Sovereign AI

Loss scaling is a numerical technique that keeps low-precision training stable by preventing small gradients from vanishing to zero. In formats like FP16, the smallest representable normal value is around 6×10^-5 (with subnormals extending to roughly 6×10^-8), yet many gradients in deep networks fall at or below that floor. Anything smaller simply rounds to zero — a failure mode called underflow — and a parameter that receives a zero gradient never learns. Left uncorrected, underflow makes mixed-precision training silently diverge from what full-precision training would have produced, which is exactly the kind of bug that wastes a week of GPU time before anyone notices.

How it works

The fix is almost embarrassingly simple. After the forward pass and before backpropagation, the loss value is multiplied by a scale factor — say 1024 or 65536. By the chain rule, every gradient produced during backprop is then scaled by that same factor, shifting the entire distribution of tiny values up into the format's representable range. Once gradients are computed, they are divided by the same factor before being applied to the FP32 master weights, so the arithmetic is mathematically equivalent to the unscaled version — just numerically survivable. The scale factor must thread a needle: large enough to rescue the smallest gradients from underflow, small enough that the largest gradients do not overflow to infinity at the top of the format's range.

Static vs dynamic scaling

A fixed scale factor can be chosen by hand, but the right value varies by model, by dataset, and even by phase of training. Modern frameworks therefore use dynamic loss scaling: the factor is raised (typically doubled) whenever a set number of steps passes without an overflow, and immediately cut (typically halved) when an overflow is detected — the offending step is simply skipped rather than applied. This self-tuning behaviour finds the largest safe scale automatically and recovers gracefully when gradients spike, at the cost of an occasional discarded step. In practice the mechanism is invisible: PyTorch's gradient scaler and its equivalents in other frameworks wrap the whole dance in a few lines of code.

Where it fits — and where it is fading

Loss scaling is most associated with FP16 mixed-precision training, where the narrow exponent makes underflow a constant threat. BF16 largely dissolves the problem by spending its bits on an FP32-sized exponent range — most BF16 training runs need no loss scaling at all, which is a big part of why the format took over. At the other extreme, FP8 resurrects and generalises the idea: with almost no representable range to spare, FP8 training uses automatic per-tensor scaling factors — loss scaling's logic applied individually to every tensor rather than once globally. It pairs naturally with gradient clipping, which guards the opposite failure of gradients growing too large. For the self-hosted builder doing fine-tuning on consumer GPUs, the takeaway is practical: mixed precision is what makes training fit in limited VRAM, and loss scaling (or a format that makes it unnecessary) is the small piece of machinery that keeps the cheap arithmetic honest.

The technique also earns a place in your debugging vocabulary. When a mixed-precision training or fine-tuning run shows loss frozen at a plateau, gradients reported as zero, or a scaler whose value collapses steadily downward, you are looking at the underflow-overflow boundary being fought in real time — and the framework's scaler statistics are the first log worth reading. Frequent overflow-skipped steps early in training are normal; persistent ones signal a learning rate or data problem that scaling cannot paper over. Knowing that distinction saves GPU-days, which on owned hardware is the resource that matters most.

Loss scaling is a numerical technique that keeps low-precision training stable by preventing small gradients from vanishing to zero. In formats like FP16, the smallest…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners