Definition
The Zero Redundancy Optimizer (ZeRO) is a family of memory optimizations in Microsoft's DeepSpeed library. Ordinary data parallelism wastes memory by storing identical copies of the optimizer states, gradients, and parameters on every device. ZeRO removes that redundancy by partitioning these training states across the available devices, so each holds only a fraction, while preserving the simplicity and compute efficiency of data parallelism.
The three stages
ZeRO is applied incrementally. Stage 1 partitions the optimizer states (such as Adam's momentum and variance), the largest consumer in mixed-precision training, with almost no change to the communication pattern. Stage 2 additionally partitions the gradients, so each device keeps only the gradients matching its optimizer shard. Stage 3 goes furthest and partitions the model parameters themselves, automatically gathering and re-scattering them during forward and backward passes. Stage 3's memory reduction scales linearly with the number of devices, enabling models with hundreds of billions of parameters.
Picking a stage
Higher stages save more memory but move more data across the interconnect. A common practice is to start at the lowest stage that fits the model and only escalate when memory still overflows, balancing savings against communication overhead. ZeRO can also offload partitioned states to CPU or NVMe for the most extreme cases.
ZeRO is the conceptual ancestor of PyTorch's Fully Sharded Data Parallel (FSDP) and an extension of Data Parallelism. Combine it with Gradient Checkpointing to push memory limits further.
In Simple Terms
The Zero Redundancy Optimizer (ZeRO) is a family of memory optimizations in Microsoft’s DeepSpeed library. Ordinary data parallelism wastes memory by storing identical copies of…
