Definition
Activation recomputation, also called gradient or activation checkpointing, is a memory-saving technique that trades extra computation for reduced memory during training. Backpropagation normally needs the intermediate activations from every layer's forward pass in order to compute gradients, and for deep models or long sequences those stored activations can consume more memory than the model's own weights.
Save now, recompute later
Instead of keeping every activation, the technique stores only a sparse subset, the checkpoints, and discards the rest. During the backward pass, whenever a missing activation is needed, it is regenerated on the fly by re-running the forward computation from the nearest stored checkpoint. For a network of L layers, checkpointing roughly every √L layers can cut activation memory from order-L down to order-√L.
The compute trade-off
The price is that the forward computation for non-checkpointed layers runs twice, once during the original forward pass and again during backward regeneration. In practice this typically adds something like 20% to 30% to training time. In exchange, the freed memory lets you fit a larger model or, often more valuably, a larger batch size on the same GPU, which can improve throughput and stability elsewhere.
Activation recomputation is a core tool for training large models on limited hardware and stacks cleanly with other memory savers. It is frequently combined with ZeRO-Offload / CPU offload and reduced-precision formats like BF16 to make ambitious training runs feasible on modest, self-owned rigs.
In Simple Terms
Activation recomputation, also called gradient or activation checkpointing, is a memory-saving technique that trades extra computation for reduced memory during training. Backpropagation normally needs the…
