Definition
Gradient accumulation lets you train with a large effective batch size when your hardware cannot hold one in memory. Instead of running a single forward and backward pass over a huge batch, you process several smaller micro-batches in sequence, add up their gradients, and only perform one optimizer step after the last micro-batch. The model sees the same total amount of data per update as a large batch would, but only one micro-batch ever occupies memory at a time.
Effective batch size
The effective batch size equals the micro-batch size multiplied by the number of accumulation steps. So a micro-batch of 8 accumulated over 4 steps behaves like a batch of 32 for the purposes of the weight update, while only requiring memory for 8 samples. This is invaluable for sovereign builders fine-tuning models on a single consumer GPU, where large batches are otherwise impossible.
Costs and caveats
The trade-off is wall-clock time: you run multiple forward and backward passes per update, so each step takes longer. There are also subtleties, batch-normalization statistics are computed per micro-batch, not over the full effective batch, and recent research notes that gradient accumulation is not always a free substitute for a genuinely larger batch at very small sizes. Used appropriately, though, it is one of the simplest memory levers available.
Gradient accumulation stacks cleanly with other memory techniques. Combine it with Gradient Checkpointing and sharded training such as Fully Sharded Data Parallel (FSDP) to fit demanding runs on modest hardware.
In Simple Terms
Gradient accumulation lets you train with a large effective batch size when your hardware cannot hold one in memory. Instead of running a single forward…
