ZeRO (Zero Redundancy Optimizer)

Sovereign AI

The Zero Redundancy Optimizer (ZeRO) is a family of memory optimizations in Microsoft's DeepSpeed library that makes large-model training fit on hardware that plain data parallelism would overflow. Ordinary data parallelism wastes memory by storing identical copies of the optimizer states, gradients, and parameters on every device. ZeRO removes that redundancy by partitioning these training states across the available devices, so each holds only a fraction, while preserving the simplicity and compute efficiency of data parallelism: every device still processes its own batch of data through what behaves like a full model.

Why the redundancy is so expensive

In mixed-precision training, the model's parameters are only a small part of the memory bill. For each parameter, an Adam-style optimizer typically keeps an FP32 master copy plus momentum and variance terms, and training also needs a gradient per parameter. The result is that the training state can be several times larger than the model itself — and classic data parallelism replicates every byte of it on every GPU. Eight GPUs training a model this way hold eight identical copies of state that only needs to exist once. That replication is exactly the "redundancy" ZeRO eliminates.

The three stages

ZeRO is applied incrementally. Stage 1 partitions the optimizer states — the largest consumer in mixed-precision training — with almost no change to the communication pattern; each device updates only its shard of the parameters. Stage 2 additionally partitions the gradients, so each device keeps only the gradients matching its optimizer shard, reducing memory further at the cost of a reduce-scatter instead of an all-reduce. Stage 3 goes furthest and partitions the model parameters themselves, automatically gathering each layer's weights just before they are needed in the forward and backward passes and releasing them afterward. Stage 3's memory reduction scales linearly with the number of devices, which is what enabled training runs with hundreds of billions of parameters on clusters of commodity-memory GPUs.

Picking a stage, and the offload escape hatch

Higher stages save more memory but move more data across the interconnect, so the standard advice is to start at the lowest stage that fits the model and escalate only when memory still overflows. On machines with slow inter-GPU links, Stage 3's constant gather-and-scatter traffic can dominate step time, making a lower stage plus gradient checkpointing the faster combination. For the most extreme cases, ZeRO-Offload and ZeRO-Infinity extend the idea beyond the GPUs entirely, parking partitioned states in CPU RAM or on NVMe and streaming them in as needed — slower per step, but it turns "impossible on this hardware" into "merely patient."

Why a home lab should care

ZeRO's ideas are not just datacenter machinery. A sovereign builder fine-tuning an open-weight model on a pair of consumer GPUs hits the same wall — optimizer states overflowing VRAM — and the same cures apply, either through DeepSpeed itself or through PyTorch's built-in Fully Sharded Data Parallel (FSDP), which is ZeRO Stage 3's conceptual descendant. Understanding the stages tells you exactly which memory you are trading for which communication, so you can size a training run to the hardware you actually own rather than the cluster you don't.

A worked intuition helps size the win. Train a 7-billion-parameter model conventionally in mixed precision and each GPU carries roughly 14 GB of weights plus somewhere near 84 GB of gradients, master weights, and optimizer states — hopeless on consumer cards. Shard that state with ZeRO across four GPUs and the per-device burden falls toward a quarter of the redundant portion; add offload and the remainder spills into system RAM. The arithmetic is approximate, but the lesson is exact: memory, not compute, is usually what forbids training on owned hardware, and ZeRO is the family of tricks that converts "forbidden" into "slow but yours."

The Zero Redundancy Optimizer (ZeRO) is a family of memory optimizations in Microsoft’s DeepSpeed library that makes large-model training fit on hardware that plain data…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners