Definition
FP8 is an 8-bit floating-point format used to push neural-network training and inference below the 16-bit precision of BF16. Because a usable 8-bit float can represent so few distinct values, FP8 is standardised as two complementary variants chosen for different parts of the training loop.
E4M3 and E5M2
The E4M3 variant uses one sign bit, four exponent bits, and three mantissa bits, representing magnitudes up to about ±448. The E5M2 variant uses one sign, five exponent, and two mantissa bits, reaching about ±57,344 plus infinities and NaN. The split is deliberate. Forward activations and weights need more precision, so E4M3 carries the forward pass. Backward gradients need more dynamic range but tolerate coarser precision, so E5M2 (or BF16) carries them. Master weights and optimizer states stay in BF16 or FP32.
The practical payoff
Halving data size again versus 16-bit formats roughly doubles the throughput of the matrix engines on accelerators that support FP8 tensor operations, while shrinking the memory footprint of the largest layers. Software frameworks manage per-tensor scaling factors automatically so that values land in FP8's usable window, which is the modern descendant of loss scaling.
FP8 is not a free lunch: it demands careful scaling, and not every layer is safe to quantise this aggressively. It is best understood as the leading edge of the reduced-precision spectrum that also includes BF16 and the FP32 master weights kept for stable optimizer updates.
In Simple Terms
FP8 is an 8-bit floating-point format used to push neural-network training and inference below the 16-bit precision of BF16. Because a usable 8-bit float can…
