Definition
A model checkpoint is a saved snapshot of a model's state captured at a point during training. A complete checkpoint typically records the model's learned weights, and often the optimizer state and other variables, so the exact training state can be restored later. Checkpoints are the save points of machine learning: they let you pause, recover, and pick the best version of a model without starting over.
What they are for
Their first job is resilience. Training a large model can run for hours or days, and a crash, power loss, or interruption without checkpoints means losing all that work. By writing a checkpoint at regular intervals, you can resume from the last saved state instead of from scratch. Their second job is selection: by saving a checkpoint after promising epochs, you can later compare versions and keep the one that generalized best, rather than whatever state training happened to end in.
In practice
Major frameworks make checkpointing routine. Keras and TensorFlow expose a checkpoint callback that writes to disk during training, and PyTorch provides a save function for the same purpose. A common pattern is to checkpoint only when validation performance improves, so the saved file always holds the best model seen so far. For self-hosted training, disciplined checkpointing is what makes long runs on your own hardware safe to interrupt and easy to roll back.
For related concepts, see our entries on the training epoch, overfitting, and catastrophic forgetting.
In Simple Terms
A model checkpoint is a saved snapshot of a model’s state captured at a point during training. A complete checkpoint typically records the model’s learned…
