FP8

Sovereign AI

FP8 is an 8-bit floating-point format used to push neural-network training and inference below the 16-bit precision of BF16. A floating-point number splits its bits among a sign, an exponent (setting dynamic range), and a mantissa (setting precision); with only 8 bits total there is no allocation that serves every purpose, so FP8 is standardised as two complementary variants — a deliberate his-and-hers design in which each variant covers the other's weakness across different parts of the training loop.

E4M3 and E5M2

The E4M3 variant uses one sign bit, four exponent bits, and three mantissa bits, representing magnitudes up to about ±448 with comparatively fine granularity. The E5M2 variant uses one sign, five exponent, and two mantissa bits, reaching about ±57,344 and following IEEE conventions with infinities and NaN. The split maps onto the arithmetic of training. Forward-pass activations and weights are reasonably well-behaved in magnitude but sensitive to precision, so E4M3 carries the forward pass. Backward-pass gradients span wildly varying magnitudes but tolerate coarse precision, so E5M2 (or BF16) carries them. Master copies of the weights and the optimizer states stay in higher precision — BF16 or FP32 master weights — so that many small updates are not lost to rounding.

The practical payoff

Halving data size again versus 16-bit formats roughly doubles the throughput of the matrix engines on accelerators whose tensor cores support FP8 arithmetic, while halving the memory traffic and footprint of the largest layers — and memory bandwidth, not raw compute, is the binding constraint in much of modern deep learning. Because an 8-bit float can represent so few distinct values, everything hinges on scaling: frameworks maintain per-tensor scaling factors, tracked automatically from recent value statistics, that shift each tensor into FP8's narrow usable window before conversion and back out afterward. That machinery is the modern descendant of loss scaling, generalised from one global factor to a factor per tensor.

FP8, quantization, and the home lab

It helps to place FP8 on the reduced-precision spectrum. It is a floating-point 8-bit format used natively during training and high-throughput serving on supporting hardware, which distinguishes it from the integer formats of post-training quantization (INT8, INT4, and the GGUF family) that compress an already-trained model for deployment. The two attack the same enemy — memory — from different ends of the lifecycle. For a self-hosted builder, FP8's relevance depends on hardware: recent GPU generations expose FP8 tensor operations and FP8-quantized checkpoints of open-weight models are increasingly common, cutting VRAM needs roughly in half versus 16-bit with modest quality cost, while older cards see no speedup and are better served by integer quantization. FP8 is not a free lunch — it demands careful scaling, and not every layer or model survives such aggressive precision reduction — but it marks the current leading edge of a long trend: every halving of precision that quality can survive is a doubling of what a given machine, including one you own, can run for inference.

Practical guidance for self-hosters

If you are choosing formats for a home inference box, the decision tree is short. First, check what your GPU accelerates natively: FP8 checkpoints only pay off on hardware whose tensor cores execute FP8 matrix math, while integer quantizations run acceptably almost anywhere. Second, match format to bottleneck — if the model barely fits in VRAM, aggressive integer quantization frees the most memory; if it fits comfortably and you want throughput for serving multiple users, a native FP8 build on supporting hardware is often the better trade. Third, always validate on your own prompts: precision loss is task-dependent, and a format that is transparent for summarization can measurably degrade long-chain reasoning or code generation. The encouraging trend line is that each hardware generation makes lower precision more usable — meaning the models a sovereign operator can serve well keep growing faster than the budget required to serve them.

FP8 is an 8-bit floating-point format used to push neural-network training and inference below the 16-bit precision of BF16. A floating-point number splits its bits…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners

FP8

Definition