Definition
FP16 and INT8 are two of the lower-precision number formats used to store and compute the weights and activations inside a neural network. Standard training uses 32-bit floating point (FP32); shrinking to FP16 (16-bit half precision) or INT8 (8-bit integer) cuts memory footprint and accelerates the matrix math that dominates inference. For anyone running a model on their own hardware, choosing the right precision is the difference between a model that fits in memory and one that does not.
FP16 vs INT8
FP16 keeps a floating-point representation with a sign, exponent, and mantissa, so it covers a wide dynamic range and roughly halves memory versus FP32. It is usually a drop-in swap with little accuracy loss. INT8 maps values onto just 256 discrete integer levels, making the model about four times smaller than FP32 and dramatically faster on hardware with integer math pipelines, but it requires careful calibration to avoid degrading accuracy.
The Accuracy Trade-Off
Going lower-bit is never free. INT8 discards precision, so values must be scaled and calibrated, and stubborn models may need quantization-aware training to recover lost accuracy. The payoff is real: smaller models, lower power draw, and higher throughput, which is exactly what matters on a constrained edge device or a home server.
Precision is the lever behind most model-size optimization. See our companion entries on TOPS for how precision changes a chip's rated throughput, and Local LLM for running quantized models on your own gear.
In Simple Terms
FP16 and INT8 are two of the lower-precision number formats used to store and compute the weights and activations inside a neural network. Standard training…
