Optimizer State

Sovereign AI

Optimizer state is the per-parameter bookkeeping an adaptive optimizer keeps between update steps. The Adam optimizer and its variants, the workhorses of modern deep learning, maintain two such values for every weight: a running average of past gradients (the first moment, or momentum) and a running average of squared gradients (the second moment, or variance). These let the optimizer adapt the effective step size for each parameter individually, which is a large part of why Adam-family optimizers train stubborn networks that plain stochastic gradient descent struggles with. The price is memory: those two extra numbers exist for every single trainable weight, and they live on the accelerator for the entire run.

A hidden memory tax

Because Adam stores two states per parameter, plus often an FP32 master copy of the weights, its memory footprint can dwarf the model itself. In mixed-precision training the FP32 parameter copy, momentum, and variance each cost four bytes per parameter, roughly twelve bytes of optimizer overhead for every weight. For a billion-parameter model that is on the order of twelve gigabytes before activations or the model's own low-precision weights are even counted. Scale that to a seven-billion-parameter model and the optimizer alone wants more memory than most consumer GPUs physically carry. This is the arithmetic that surprises people who assume a model that runs on their card can also be trained on it: inference needs the weights, but full training needs the weights, the gradients, and this hidden ledger on top.

Why it shapes infrastructure

Optimizer states and gradients together can exceed 85% of training memory, and entire families of techniques exist specifically to tame them. The ZeRO line of optimizations partitions these states across devices so each GPU holds only a shard; offloading pushes them to host RAM or NVMe with ZeRO-Offload / CPU offload, trading PCIe traffic for capacity. Memory-efficient optimizers attack the state itself: 8-bit optimizers quantize the moments, and some research optimizers compress or eliminate one of the two moments entirely. Each approach keeps the same mathematical update while shrinking or relocating the ledger that supports it.

The self-hoster's angle

For anyone training on their own hardware, optimizer state is the first budget line to check. It also explains why parameter-efficient fine-tuning is so attractive on a workbench-scale rig: with a LoRA adapter, only the small adapter matrices are trainable, so optimizer state exists only for them. The frozen base model carries no moments at all, which collapses the twelve-bytes-per-parameter tax to a rounding error and is a major reason a single 24 GB card can fine-tune models it could never fully train. The choice of LoRA rank therefore sets not just adapter capacity but optimizer overhead too.

Checkpoints and resumption

Optimizer state is also why training checkpoints are so much larger than the models they produce. To resume a run bit-for-bit, you must save the moments alongside the weights; discard them and the optimizer restarts cold, often with a visible loss spike while the running averages warm back up. A distributed checkpoint that shards optimizer state across ranks must be reassembled or re-sharded if you resume on a different device count, a routine but easy-to-fumble step in home-lab training. Understanding optimizer state is key to planning self-hosted training: it works alongside the FP32 master weights and stabilising tricks like gradient clipping during each step, and together those three items define the real memory bill of teaching a model anything new.

A useful habit when sizing any training job is to write the budget out explicitly before touching a GPU: weights, gradients, optimizer state, activations, each in bytes per parameter, each multiplied out against your card's actual capacity. The optimizer line is the one newcomers forget and the one that most often decides between "trains comfortably" and "crashes at step one." Once you can read that ledger fluently, every technique in the memory-efficiency literature stops being magic and becomes a line-item negotiation: shard this, quantize that, offload the rest. That is the same discipline a miner applies to a power budget, and it pays off in exactly the same way, hardware you already own doing work you were told it could not.

Optimizer state is the per-parameter bookkeeping an adaptive optimizer keeps between update steps. The Adam optimizer and its variants, the workhorses of modern deep learning,…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners