Definition
Loss scaling is a numerical technique that keeps low-precision training stable by preventing small gradients from vanishing to zero. In formats like FP16, the smallest representable value is around 6×10-8, yet many gradients in deep networks fall well below that. Anything smaller simply rounds to zero, a failure mode called underflow, and a parameter that receives a zero gradient never learns.
How it works
After the forward pass and before backpropagation, the loss value is multiplied by a scale factor. By the chain rule, every gradient produced during backprop is then scaled by that same factor, shifting the whole distribution of tiny values up into the format's representable range. Once gradients are computed, they are divided by the same factor before being applied to the FP32 master weights, so the math is mathematically equivalent but numerically survivable.
Static vs dynamic scaling
A fixed scale factor can be chosen by hand, but modern frameworks use dynamic loss scaling: the factor is raised whenever many steps pass without an overflow and immediately halved when an overflow is detected. This self-tuning behaviour finds the largest safe scale automatically and recovers gracefully when gradients spike.
Loss scaling is most associated with FP16 mixed-precision training. BF16 often avoids the need for it because it shares FP32's wider exponent range, while FP8 generalises the idea into automatic per-tensor scaling. It pairs naturally with gradient clipping, which guards the opposite failure of gradients growing too large.
In Simple Terms
Loss scaling is a numerical technique that keeps low-precision training stable by preventing small gradients from vanishing to zero. In formats like FP16, the smallest…
