LLM quantization converts model weights from high-precision floats (FP16 = 2 bytes/weight) to lower-precision integers — INT8 (1 byte) or INT4 (0.5 bytes) — cutting VRAM requirements by 50–75% with minimal quality loss at INT8 and moderate, often acceptable loss at INT4. For a single consumer GPU running a 7 B-parameter model, INT4 typically brings VRAM needs from ~14 GB down to roughly 4–6 GB (including overhead), enabling hardware that would otherwise be unusable. Quantization-Aware Training (QAT) — deployed by Google DeepMind in the Gemma 3 QAT series — closes most of the quality gap by simulating low-precision arithmetic during training itself, not just at inference time.
Running a large language model locally means fitting billions of floating-point weights into your GPU’s VRAM. Quantization is the primary technique that makes this possible on affordable hardware. This guide explains each format, the practical tradeoffs, and how to pick the right tier for your setup.
If you just want to know what hardware you need, jump to the VRAM calculator or the local AI hardware guide. If you are evaluating running local AI in Canada, see the local LLM Canada overview.
The precision ladder: FP32 → FP16 → INT8 → INT4
Every neural-network weight is a number. The question is how many bits you use to store it.
FP32 — 32-bit float (training standard)
FP32 stores each weight in 32 bits (4 bytes). It is the standard precision for training and gives the model the most numerical headroom to learn. At inference time FP32 is rarely used on consumer hardware because the VRAM cost is double FP16 with no practical quality benefit — a 7 B-parameter model alone would require roughly 28 GB of VRAM for weights before factoring in context or KV cache.
FP16 / BF16 — 16-bit float (inference baseline)
FP16 (IEEE 754 half-precision) and BF16 (Brain Float 16, used by Google TPUs and modern Nvidia Ampere/Ada GPUs) each store weights in 16 bits (2 bytes). Quality at inference is essentially indistinguishable from FP32 for the vast majority of tasks. FP16 is the baseline precision most quality benchmarks cite — when a model card says a score was achieved “at full precision,” it typically means FP16.
- 7 B model — weight-only VRAM: ~14 GB (FP16)
- 13 B model: ~26 GB
- 70 B model: ~140 GB
Note: These are weight-only estimates. Total VRAM during inference is higher due to KV cache (scales with context length), activations, and framework overhead. Budget 20–30 % on top as a rough margin for short contexts.
INT8 — 8-bit integer (the reliable workhorse)
INT8 quantization converts each weight from a 16-bit float to an 8-bit integer, cutting per-weight storage in half. A small per-layer scaling factor is stored alongside to recover the approximate original magnitude at inference time.
Quality impact at INT8 is generally low. Most benchmarks show a degradation of well under 1–2 percentage points on standard tasks (MMLU, HellaSwag, etc.) compared to FP16, though the exact delta depends on the model architecture and the quantization method. INT8 is a safe choice when VRAM is tight but you cannot tolerate any perceptible quality regression.
- 7 B model — weight-only VRAM: ~7 GB (INT8)
- 13 B model: ~13 GB
- 70 B model: ~70 GB
INT4 — 4-bit integer (the consumer sweet spot)
INT4 stores each weight in just 4 bits (half a byte), delivering a ~75 % reduction in weight-only VRAM vs FP16. This is what makes it possible to run a capable 7 B model on an 8 GB GPU or a 13 B model on a 12–16 GB card.
Quality impact is more noticeable than INT8. Typical benchmark deltas versus FP16 are in the 2–5 percentage-point range on reasoning-heavy tasks, though this varies considerably by model and quantization method. For chat, summarization, and coding assistance at moderate difficulty, most users find INT4 perfectly adequate. QAT (covered below) can close this gap substantially.
- 7 B model — weight-only VRAM: ~3.5–5 GB (INT4, depending on groupsize and format)
- 13 B model: ~7–8 GB
- 70 B model: ~35–42 GB
All VRAM figures are approximate and depend on quantization method, groupsize, and whether activations are also quantized. Always verify with the model-specific card on Hugging Face before purchasing hardware.
Quality-vs-VRAM quick-reference table
The table below summarises the formats you will encounter. Quality retention figures are indicative ranges drawn from community benchmarks (MMLU, perplexity scores); treat them as directional, not absolute — results vary by model family, size, and task type. Hedge: always test on your own workload before committing to a format.
| Format | Bits / weight | VRAM vs FP16 | Typical quality retention | Best suited for |
|---|---|---|---|---|
| FP32 | 32 | 2× | Reference (training) | Fine-tuning; rarely used at inference |
| FP16 / BF16 | 16 | 1× (baseline) | ≈ 100 % of FP32 | Inference baseline; GPU with 24 GB+ VRAM |
| INT8 | 8 | ~0.5× | ~98–99 % (indicative) | Production; minimal quality compromise acceptable |
| INT4 (PTQ) | 4 | ~0.25–0.3× | ~95–98 % (indicative) | Consumer GPUs 8–16 GB; general chat/coding |
| INT4 (QAT) | 4 | ~0.25–0.3× | Closer to INT8 levels (model-dependent) | Best of both worlds; Gemma 3 QAT is the reference example |
| Q4_K_M (GGUF) | ~4.5 effective | ~0.3× | Slightly above plain INT4; K-quant grouping helps | CPU + GPU hybrid via llama.cpp; the community default |
| Q5_K_M (GGUF) | ~5.5 effective | ~0.35× | Near INT8; perplexity loss very small | Higher-quality CPU inference; slightly more VRAM |
| Q2_K (GGUF) | ~2.6 effective | ~0.17× | Notable degradation; use only under extreme constraints | Absolute minimum VRAM / RAM budgets only |
Sources: llama.cpp project documentation (Georgi Gerganov et al.), Hugging Face community benchmarks, Google DeepMind Gemma 3 QAT model cards. Quality figures are ranges based on community perplexity and benchmark reports — not a guarantee for any specific model or task.
How quantization is done: PTQ methods
Post-Training Quantization (PTQ) is applied after the model has been fully trained. The weights already exist at FP16/FP32; a quantization algorithm converts them. No re-training required. This is the dominant approach for publicly released quantized models.
GPTQ
GPTQ (Frantar et al., 2022) uses a small calibration dataset to find the INT4 or INT8 representation that minimises reconstruction error layer by layer. It is the basis of most 4-bit models you will find on Hugging Face under names like model-GPTQ. GPTQ models require a compatible loader (AutoGPTQ, ExLlamaV2) and run primarily on CUDA GPUs.
AWQ (Activation-aware Weight Quantization)
AWQ (Lin et al., 2023) observes that not all weights are equally important — a small fraction have a disproportionate impact on output quality. It protects those salient weights during quantization. AWQ typically achieves lower perplexity than GPTQ at INT4 for the same model, at the cost of a slightly longer quantization process. Look for model-AWQ tags on Hugging Face.
bitsandbytes (LLM.int8())
The bitsandbytes library by Tim Dettmers implements INT8 quantization with mixed-precision decomposition — outlier activations are kept in FP16, while the bulk of weights are INT8. It is widely used via the Hugging Face load_in_8bit flag and requires a CUDA GPU. A 4-bit NormalFloat (NF4) option is also available via QLoRA workflows.
GGUF and llama.cpp
llama.cpp (created by Georgi Gerganov; open source, MIT-licensed) is the canonical CPU-focused inference engine. It uses the GGUF file format, which bundles weights and model metadata into a single file with a structured, versioned binary layout. GGUF replaced the earlier GGML format in mid-2023 and added cleaner metadata, tokeniser storage, and better version management.
GGUF supports a spectrum of quantization levels. The naming convention uses the letter Q (quantization), the bit-depth, and an optional K-quant suffix:
- Q4_K_M — 4-bit K-quant, medium. The community default. Groups of weights share a scale and minimum-value pair (the “K” grouping), which reduces error vs plain Q4_0. Typically the best VRAM-to-quality ratio.
- Q5_K_M — 5-bit K-quant, medium. Noticeably better quality than Q4_K_M with ~20 % more storage.
- Q6_K — 6-bit K-quant. Very close to FP16 quality; useful when VRAM allows.
- Q8_0 — 8-bit flat quantization. Near-lossless; file is ~half of FP16.
- Q2_K — 2-bit K-quant. Very high compression but significant quality loss. Use only when RAM is severely constrained.
A key llama.cpp feature: it can split inference across both system RAM and GPU VRAM, “offloading” a chosen number of transformer layers to the GPU while running the rest on CPU. This allows running models that exceed GPU VRAM at reduced speed, which is useful for experimentation.
Quantization-Aware Training (QAT): closing the quality gap
All the methods above are PTQ: quantize after training. QAT takes a fundamentally different approach — it simulates low-precision arithmetic during training (or a fine-tuning pass), so the model learns to be robust to quantization error from the start. The result is a model whose weights, when quantized to INT4, behave more like INT8 in practice.
Gemma 3 QAT (Google DeepMind): In 2025, Google DeepMind released QAT variants of the Gemma 3 model family targeting INT4 inference. According to the official Gemma 3 model cards and Google’s blog post on the release, the QAT variants demonstrate substantially better benchmark performance at INT4 compared to PTQ INT4 versions of the same model — in some evaluations approaching the quality of a full-precision INT8 model. The models are available on Hugging Face and are compatible with llama.cpp GGUF export toolchains. (Exact benchmark deltas should be verified against the current model cards, as Google may update them; treat published numbers as valid at date of release.)
QAT is more compute-intensive to produce than PTQ — it requires a training run, not just a calibration pass — which is why most publicly available QAT models come from organisations with significant training budgets. For end users, QAT models are used identically to their PTQ counterparts: load into llama.cpp, Ollama, vLLM, or another inference engine and run.
Practical implication: If a QAT INT4 version of your chosen model exists, prefer it over a PTQ INT4 version. The VRAM requirement is the same; the quality is better.
Tier mapping: which quantization for your setup?
The right quantization tier depends on your GPU VRAM, the model size you want to run, and your quality tolerance. Use this decision map:
Tier 1 — 24 GB+ VRAM (RTX 3090 / 4090 / A5000 / A6000 class)
Run 7 B–13 B models at FP16 with no compromise. Run 70 B models at INT4 (or Q4_K_M in GGUF) — at this VRAM size, a 70 B Q4_K_M model fits in roughly 40 GB, so you may need two GPUs or CPU offloading for the largest models. For 34 B models, INT8 is comfortable.
Tier 2 — 16 GB VRAM (RTX 3080/4080, RX 7900 XTX)
7 B at FP16 is tight but possible (~14 GB weights). 13 B at INT4 / Q4_K_M is the practical ceiling (~8 GB weights + overhead). For 34 B models, llama.cpp CPU-offload hybrid. Prioritise Q4_K_M or AWQ INT4 for the best quality at this tier.
Tier 3 — 8–12 GB VRAM (RTX 3060/4060/4070, RX 7700/7800 XT)
This is the most common consumer GPU tier. 7 B at INT4 / Q4_K_M fits comfortably (~4–5 GB weights). 13 B at INT4 is possible on 12 GB cards. If a QAT INT4 variant of your model exists, use it — quality gain is free. For context-heavy tasks (long documents), watch KV-cache growth; reducing context window helps.
Tier 4 — 4–6 GB VRAM or CPU-only
At 4–6 GB VRAM, a 7 B Q4_K_M model may still fit if you keep context short. For CPU-only inference via llama.cpp, VRAM is not the constraint — system RAM is (GGUF maps the model into RAM). A machine with 16 GB RAM can run a 7 B Q4_K_M model; 32 GB opens up 13 B models. CPU inference is slower (tokens per second rather than tens of tokens per second on GPU) but fully functional.
Quick rule: If you do not know where to start, try Q4_K_M in llama.cpp. It is the community default for a reason — reliable quality, broad model support, and it runs on CPU or GPU. Upgrade to Q5_K_M if you have VRAM headroom and want sharper reasoning. If a QAT variant of the model exists, prefer that over plain PTQ at the same bit-depth.
For a tool that calculates exact requirements by model and format, see the local LLM VRAM calculator. For hardware-specific recommendations, see the local AI hardware guide.
Quantization and data sovereignty in Canada
For Canadian organisations evaluating on-premises AI to keep data out of US-jurisdiction cloud providers, quantization is not just an efficiency technique — it is an enabler. An unquantized 70 B model requires server-grade hardware costing tens of thousands of dollars. The same model at Q4_K_M runs on a workstation with two RTX 4090s or a single A100 40 GB — hardware that is considerably more accessible for small and mid-sized organisations.
This matters because meaningful sovereignty over AI inference requires actually running the model locally, not just buying a subscription to a “Canadian cloud” that still sends data through US-based APIs. Quantization lowers the hardware floor enough that on-premises deployment is a realistic option for a much broader range of organisations.
For more context on the Canadian regulatory environment and the case for local AI, see the local LLM in Canada guide and the broader digital sovereignty overview. For help scoping an on-premises deployment, visit our AI sovereignty consulting page.
Frequently asked questions
What is the difference between INT4 and INT8 quantization?
INT8 uses 8 bits (1 byte) per weight and cuts VRAM roughly in half vs FP16. INT4 uses 4 bits (0.5 bytes) and cuts VRAM by roughly 75 %. INT8 typically shows very small quality loss; INT4 shows somewhat more, depending on the model and method. INT4 is the more common choice on consumer hardware because INT8 still requires more VRAM than many GPUs have.
Does quantization affect the quality of LLM outputs?
Yes, but the impact varies significantly by format and model. INT8 impact is usually too small to notice in practice. INT4 PTQ impact is visible on difficult reasoning tasks in controlled benchmarks but often imperceptible in everyday chat and coding use. QAT methods (as in Google DeepMind’s Gemma 3 QAT) substantially close the quality gap for INT4. The best approach is to test your target model and quantization format on your actual use case rather than relying solely on aggregate benchmark scores.
What is QAT and why is it different from normal quantization?
Quantization-Aware Training (QAT) simulates quantization error during the training or fine-tuning process, rather than applying it after the fact (Post-Training Quantization, PTQ). Because the model is trained to be tolerant of precision loss, the resulting quantized model retains more quality than a PTQ model at the same bit-depth. Google DeepMind’s Gemma 3 QAT series is a widely cited public example of this approach applied at INT4.
What is GGUF and how is it different from other quantized model formats?
GGUF is a file format created by the llama.cpp project (led by Georgi Gerganov). It bundles model weights, metadata, and tokeniser data into a single portable binary file. Unlike GPTQ or AWQ formats — which require specific GPU-side loaders — GGUF is designed to run efficiently on CPU (using system RAM) as well as GPU, with support for hybrid offloading between the two. It is the dominant format for consumer and hobbyist local inference. GGUF supports many quantization levels within the same format, from Q2_K through Q8_0 and FP16.
What is Q4_K_M and why do people recommend it?
Q4_K_M is a specific GGUF quantization level: 4-bit, using K-quants (a grouping technique that stores per-group scale and minimum values to reduce quantization error), at the “medium” groupsize setting. Community benchmarks consistently show Q4_K_M offers better quality than simpler 4-bit schemes (like Q4_0) at very similar file size. It has become the community default for a good reason: it is well-supported, well-tested, and hits the sweet spot of quality vs VRAM/RAM usage for most consumer hardware.
Can I run quantized models without a GPU?
Yes. llama.cpp and its frontends (Ollama, LM Studio, etc.) run GGUF models entirely on CPU using system RAM. A 7 B Q4_K_M model requires approximately 4–5 GB of RAM for weights alone. On a machine with 16 GB RAM you have comfortable headroom. Generation will be slower than GPU inference — typical speeds on a modern CPU are in the range of a few tokens per second, which is readable but not instant. A GPU dramatically accelerates token generation even if only a portion of layers are offloaded.
Which quantization should I use for a 7 B model on an 8 GB GPU?
Q4_K_M (GGUF) is the standard starting point. At roughly 4–5 GB for weights, it leaves headroom for KV cache and activations on an 8 GB card at moderate context lengths. If a QAT INT4 GGUF version of your model exists, prefer that. If you are using Python-based tooling (Hugging Face Transformers), AWQ INT4 is a strong alternative. INT8 will likely exceed 8 GB for a 7 B model and is not recommended at this VRAM level.
Does the model size matter more than quantization format?
Generally, a larger model at lower precision outperforms a smaller model at higher precision, up to a point. A 13 B model at INT4 typically produces better outputs than a 7 B model at FP16 on most tasks, provided both fit in available VRAM. The practical strategy is: use the largest model that fits comfortably, then choose the highest-quality quantization format the remaining VRAM allows. Use the VRAM calculator to find the boundary for your specific hardware.
Is quantization the same as pruning or distillation?
No. Quantization reduces the numerical precision of existing weights. Pruning removes weights (or neurons/attention heads) that are deemed unimportant, producing a sparser model. Distillation trains a smaller “student” model to mimic a larger “teacher” model. All three are model compression techniques, and they can be combined (a model can be distilled, then quantized). GGUF and GPTQ quantization do not change the model’s architecture or parameter count — they only change how precisely each weight is stored.
