ZeRO-Offload / CPU Offload

Sovereign AI

ZeRO-Offload is a training technique that relocates the heaviest memory consumers — the optimizer states and gradients — from scarce GPU memory to abundant host CPU memory. For large transformer models, optimizer states and gradients can account for over 85 percent of total memory use, so moving them off the GPU dramatically lowers the card requirements for training a given model. It is one of the defining techniques of the "train big on small hardware" toolbox, and for a sovereign builder with one workstation rather than a rented cluster, it can be the difference between a model that fits and one that doesn't.

Why optimizer states dominate

An adaptive optimizer keeps several FP32 values per parameter — the momentum and variance terms of the optimizer state, plus a master copy of the weights. In mixed-precision training these can require roughly twelve bytes per parameter just for the optimizer, dwarfing the two bytes the model's own BF16 weights occupy. The insight behind offloading is that this bulk is touched only once per step, during the optimizer update — it does not need to sit on the GPU while forward and backward passes run. By partitioning it and parking it in CPU RAM, ZeRO-Offload reportedly lets models with billions of parameters train on a single GPU that could otherwise never hold them.

The cost of offloading

Nothing is free: gradients must be copied to the CPU, the optimizer step runs on the CPU, and updated values are copied back, all over the relatively slow PCIe link. To stop the optimizer from becoming the bottleneck, the technique pairs with a highly optimized CPU implementation of the Adam optimizer and overlaps transfers with GPU compute, so the card keeps working while data moves. The result trades some throughput — often a modest fraction, sometimes more on bandwidth-starved systems — for the ability to train models far larger than the GPU alone could fit. System RAM and PCIe generation become first-class specs: a workstation with 128 GB of RAM and a modern link offloads far more gracefully than a gaming box with 32 GB.

The wider ZeRO family

Offload is one member of the Zero Redundancy Optimizer (ZeRO) family, which partitions optimizer states, gradients, and eventually the parameters themselves across whatever memory pools exist — multiple GPUs in its multi-card stages, CPU RAM in Offload, and even NVMe storage in the later ZeRO-Infinity extension. The unifying idea is that no byte should be replicated where one copy, placed in the cheapest adequate tier, will do.

When to offload — and when to sidestep

A final planning note: offload changes what "enough hardware" means when you spec a machine. For an offload-heavy workload, system RAM capacity and PCIe bandwidth belong on the same line of the budget as the GPU itself, and doubling RAM is often the cheapest capability upgrade available — far cheaper per gigabyte than VRAM. That inversion, where the humble parts of the box become the enablers of the expensive one, is the practical legacy of the ZeRO line of work.

Offloading earns its complexity in full-parameter training, where the optimizer bulk is unavoidable. For many practical goals on a single machine, the cheaper path is to shrink the trainable set instead: a LoRA adapter trains a tiny fraction of the parameters, so the optimizer state that offloading exists to relocate barely exists in the first place. A sensible decision order for limited VRAM: try parameter-efficient fine-tuning first, add gradient checkpointing to tame activations, and reach for CPU offload when the job genuinely requires updating every weight of a model bigger than your card.

ZeRO-Offload is a training technique that relocates the heaviest memory consumers — the optimizer states and gradients — from scarce GPU memory to abundant host…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners