Definition
Quantization-aware training (QAT) is a technique for producing models that remain accurate after being compressed to low numerical precision. Rather than quantizing a finished model in one pass, QAT folds the effects of quantization into the training process itself: during the forward pass the model simulates the rounding and clipping of low-precision arithmetic, so it learns weights that are robust to that loss of precision. The result is a model that degrades far less when finally deployed in 8-bit, 4-bit, or even lower formats.
QAT vs. post-training quantization
The simpler alternative, post-training quantization (PTQ), just converts a pre-trained model's weights to lower precision after the fact. PTQ is fast and needs no retraining, and at 8 bits it usually works well. But as you push below 4 bits, PTQ's accuracy drop becomes severe — and that is exactly where QAT earns its cost, recovering accuracy that PTQ loses. The trade-off is that QAT requires full retraining with access to training data, which is far more expensive.
Why it matters for local models
Aggressive quantization is what lets large models fit on consumer hardware. When you download a heavily quantized open-weight model that still performs well at 4-bit, QAT (or related advanced methods) is often part of why it holds up. Understanding the difference helps you judge which low-bit builds are likely to be usable.
For the broader concept of reducing precision, see quantization; for how distillation similarly shrinks models, see knowledge distillation.
In Simple Terms
Quantization-aware training (QAT) is a technique for producing models that remain accurate after being compressed to low numerical precision. Rather than quantizing a finished model…
