Definition
ZeRO-Offload is a training technique that relocates the heaviest memory consumers, the optimizer states and gradients, from scarce GPU memory to abundant host CPU memory. For large transformer models, optimizer states and gradients can account for over 85% of total memory use, so moving them off the GPU dramatically lowers the card requirements for training a given model.
Why optimizer states dominate
An adaptive optimizer keeps several FP32 values per parameter, the optimizer state plus a master copy of the weights. In mixed-precision training these can require roughly twelve bytes per parameter just for the optimizer, dwarfing the model's own footprint. By partitioning and offloading them to CPU RAM, ZeRO-Offload reportedly lets models with billions of parameters train on a single GPU that could otherwise never hold them.
The cost of offloading
Nothing is free: gradients must be copied to the CPU, the optimizer step runs on the CPU, and updated values copied back, all over the relatively slow PCIe link. To stop the optimizer from becoming the bottleneck, the technique pairs with a highly optimised CPU implementation of the Adam optimizer and overlaps transfers with GPU compute. The result trades some throughput for the ability to train models far larger than the GPU alone could fit.
For sovereign builders training on a single workstation rather than a rented cluster, CPU offload is one of the most powerful levers available. It complements memory savers like activation recomputation and reduced-precision formats such as BF16.
In Simple Terms
ZeRO-Offload is a training technique that relocates the heaviest memory consumers, the optimizer states and gradients, from scarce GPU memory to abundant host CPU memory.…
