Master Weights

Sovereign AI

Master weights are a full-precision FP32 copy of a model's parameters maintained throughout low-precision training. In mixed-precision training, the forward and backward passes run in a 16-bit format for speed and memory savings, but the authoritative weights are kept in 32-bit floating point, and only that copy is updated by the optimizer. The FP32 master copy is one of the three pillars of stable mixed-precision training, alongside loss scaling and accumulating long sums and products into FP32.

Why a separate copy is needed

The problem is rounding. Each optimizer step adds an update — gradient times learning rate — to every weight, and that update is routinely thousands of times smaller than the weight itself. Floating-point addition can only preserve a sum if the two operands are within the format's precision range of each other: FP16 carries about 3 decimal digits of precision, so adding 0.0001 to 1.0 in FP16 returns exactly 1.0. The update vanishes, and the parameter silently stops learning — not with an error, but with training that plateaus below where it should. FP32, with about 7 decimal digits, comfortably preserves the same update. Keeping the running weights in FP32 protects these small but meaningful increments across hundreds of thousands of steps, which is what lets low-precision training match the final accuracy of full FP32 training rather than merely approximating it.

The training loop in practice

Each iteration follows the same rhythm: cast the FP32 master weights down to a 16-bit working copy; run the forward and backward passes in 16-bit, where the heavy matrix math lives and where reduced precision buys real speed on tensor cores; then apply the resulting gradients to the FP32 master copy. Only the FP32 version persists between steps — the 16-bit copy is disposable scaffolding, rebuilt fresh every iteration. This halves the memory traffic and compute cost of the expensive passes while spending full precision only where numerical fidelity actually matters: the accumulation of tiny updates over time.

The memory bill

The fidelity is not free. Mixed-precision training stores the 16-bit working weights plus the FP32 masters, and popular optimizers like Adam add two more FP32 tensors (momentum and variance) per parameter. The commonly cited total is roughly 16 bytes per parameter before activations — so a 7-billion-parameter model wants on the order of 112 GB just for weights and optimizer state. This is the memory budget that dominates training-rig planning, and it is why the master copy is a favorite target of memory engineering: sharding schemes like ZeRO and FSDP split masters and optimizer state across devices, and 8-bit optimizer variants compress the state tensors while keeping the master weights' precision where it counts.

Why it matters to a self-hosted builder

Master weights are the reason a model can train in BF16 or even FP8 without degrading — the low-precision formats do the racing while FP32 keeps the books. They are also a key line in the VRAM arithmetic that decides what fits on your hardware, and they explain an asymmetry every home-lab operator eventually notices: inference needs only one low-precision copy of the weights, while training needs several copies at mixed precisions. The gap between "my GPU can run this model" and "my GPU can train this model" is, in large part, the master weights and their optimizer entourage.

The concept also sharpens intuition about what a released checkpoint actually is. The BF16 or FP16 weights you download are typically a cast-down snapshot of the training run's FP32 masters — perfectly good for inference, but missing the precision head-room that training required. This is why resuming serious training from a low-precision checkpoint without restoring proper master copies and optimizer state degrades results, and why full training checkpoints are several times larger than their inference-ready exports. Knowing which artifact you are holding — the working copy or the master ledger — is the difference between a clean resume and a mysteriously ailing run.

Master weights are a full-precision FP32 copy of a model’s parameters maintained throughout low-precision training. In mixed-precision training, the forward and backward passes run in…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners