Definition
Optimizer state is the per-parameter bookkeeping an adaptive optimizer keeps between update steps. The Adam optimizer and its variants, the workhorses of modern deep learning, maintain two such values for every weight: a running average of past gradients (the first moment, or momentum) and a running average of squared gradients (the second moment, or variance). These let the optimizer adapt the effective step size for each parameter individually.
A hidden memory tax
Because Adam stores two states per parameter, plus often an FP32 master copy of the weights, its memory footprint can dwarf the model itself. In mixed-precision training the FP32 parameter copy, momentum, and variance each cost four bytes per parameter, roughly twelve bytes of optimizer overhead for every weight. For a billion-parameter model that is on the order of twelve gigabytes before activations or the model's own low-precision weights are even counted.
Why it shapes infrastructure
This is exactly why optimizer states and gradients together can exceed 85% of training memory, and why techniques exist specifically to tame them. Partitioning these states across devices, or offloading them to host memory with ZeRO-Offload / CPU offload, directly attacks the single largest line item in the training memory budget. Research into memory-efficient optimizers also targets this state, compressing or eliminating one of the two moments.
Understanding optimizer state is key to planning self-hosted training. It works alongside the FP32 master weights and stabilising tricks like gradient clipping during each step.
In Simple Terms
Optimizer state is the per-parameter bookkeeping an adaptive optimizer keeps between update steps. The Adam optimizer and its variants, the workhorses of modern deep learning,…
