Model Checkpoint

Sovereign AI

A model checkpoint is a saved snapshot of a model's state captured at a point during training. A complete checkpoint records the model's learned weights and typically the optimizer state, learning-rate schedule position, and step counters, so the exact training state can be restored and continued later. Checkpoints are the save points of machine learning: they let you pause, recover, compare, and pick the best version of a model without ever starting over from scratch.

What checkpoints are for

Their first job is resilience. Training a large model can run for hours or days, and a crash, power loss, out-of-memory failure, or interrupted spot instance without checkpoints means losing all of that compute. Writing a checkpoint at regular intervals converts a catastrophic loss into a resume-from-last-save. Their second job is selection: by saving after promising epochs, you can later compare versions on validation data and keep the one that generalized best, rather than whatever state training happened to end in — a direct defence against overfitting, since the best checkpoint frequently precedes the final one. Their third job is forensics: when a fine-tuning run degrades a capability or exhibits catastrophic forgetting, intermediate checkpoints let you locate where the damage began and branch from just before it.

Checkpoints versus distributed model files

The word has drifted into a second, looser usage worth untangling. The files you download from a model hub — the weights that Ollama or llama.cpp load for inference — are often called "checkpoints" too, because that is what they were: the final saved state of someone's training run. A distributed inference file, however, usually contains weights only, frequently converted and quantized into a format like GGUF, with the optimizer state stripped out. You can run it, but you cannot resume training from it as if the run had never stopped. True training checkpoints are substantially larger than the weights alone — optimizer state for common optimizers can double or triple the footprint — which is why disk budgeting is a real part of training on your own hardware.

In practice

Major frameworks make checkpointing routine: Keras and TensorFlow expose checkpoint callbacks that write during training, and PyTorch provides save/load functions for state dictionaries. Established patterns do most of the thinking for you — checkpoint every N steps or minutes; additionally checkpoint whenever validation performance improves, so one file always holds the best model so far; and rotate old files so a long run does not silently fill the disk. Two habits separate durable setups from fragile ones: write checkpoints atomically (save to a temporary file, then rename) so a crash mid-write cannot corrupt your only good save, and actually test a resume before you need it.

The sovereignty angle

For self-hosted training and fine-tuning, disciplined checkpointing is what makes long runs on your own hardware safe to interrupt and easy to roll back — the difference between a homelab GPU box you can confidently pull the plug on and one holding days of unsaved work hostage. It is the same instinct that drives the rest of the sovereign stack: your weights, your training state, your rollback points, on disks you control. A checkpoint directory is to a training run what a verified backup is to a node — boring, cheap, and the only thing standing between an incident and a disaster.

A reasonable starter policy for a single-GPU homelab: checkpoint every 30 minutes or 500 steps (whichever comes first), keep the last three rolling saves plus the best-validation save, and store the best checkpoint on a second disk. That handful of rules costs a few gigabytes and removes nearly every way a training run can hurt you — which is exactly the ratio of effort to protection you want from any backup discipline.

A model checkpoint is a saved snapshot of a model’s state captured at a point during training. A complete checkpoint records the model’s learned weights…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners