Definition
Data parallelism is the most common way to train a neural network across multiple accelerators. Every device holds a complete copy of the model, and the training batch is split into shards so each device processes a different slice of the data. After the backward pass, the per-device gradients are averaged (an all-reduce) so all replicas stay numerically identical and step in lockstep.
Why it scales well
Because each device runs an independent forward and backward pass and only needs to communicate gradients once per step, data parallelism has modest communication overhead relative to other strategies. It is the default in frameworks like PyTorch's DistributedDataParallel (DDP). Throughput grows close to linearly with device count, which is why it remains the backbone of most training runs.
The memory ceiling
The catch is that every device must hold the entire model, its gradients, and its optimizer states. Once a model no longer fits on a single accelerator, plain data parallelism stops working. That limit is exactly what sharded variants such as Fully Sharded Data Parallel and ZeRO were built to overcome, and what model-splitting strategies like tensor and pipeline parallelism sidestep by dividing the model itself.
For sovereign builders running their own compute, data parallelism is usually the first lever to pull when scaling beyond one GPU. See our entries on Fully Sharded Data Parallel (FSDP) and ZeRO (DeepSpeed) for the memory-efficient successors, and Gradient Accumulation for simulating larger effective batches on limited hardware.
In Simple Terms
Data parallelism is the most common way to train a neural network across multiple accelerators. Every device holds a complete copy of the model, and…
