GGUF Quantization Quality — Bits-per-Weight & Which Quant to Use
Which GGUF quant should you run? Every GGUF LLM quantization type by exact bits-per-weight, quant family and quality tier, so you can pick one that both fits your VRAM and stays accurate. Free CSV/JSON + REST under CC BY 4.0.
Quick answer
GGUF quantization shrinks a local LLM by storing each weight in fewer bits — the trade-off is file size and VRAM against output quality. This reference lists 19 GGUF quant types by their exact bits-per-weight, the family (full-precision floats, K-quants, importance-matrix I-quants, and legacy round-to-nearest types) and a quality tier, so you can pick one that BOTH fits your VRAM and stays accurate. Rule of thumb: Q4_K_M (~4.8 bpw) is the recommended default; use Q5_K_M or Q6_K when you have VRAM to spare; Q8_0 is effectively lossless; and below ~3 bpw prefer the importance-matrix I-quants, only on large models.
Q4_K_M is the sweet spot for most local models. Below ~3 bpw, the importance-matrix I-quants (IQ3 / IQ2) keep more quality per bit and should be reserved for large models; Q6_K and Q8_0 are near-lossless when VRAM allows. Pair this with D-Central's GPU/model VRAM data to confirm a quant both FITS and stays accurate.
Download CSV Download JSON REST API →
| Quant type | Family | Bits/weight | Quality | Notes |
|---|---|---|---|---|
| F32 | Floating point | 32 | Reference | Full single precision; the unquantized training baseline. Rarely used for local inference (huge files). |
| F16 | Floating point | 16 | Reference | Half precision; the near-lossless inference baseline that quants are measured against. |
| BF16 | Floating point | 16 | Reference | Brain-float 16; same size as F16 with a wider exponent range; a common training/inference format. |
| Q8_0 | Legacy | 8.5 | Near-lossless | 8-bit round-to-nearest; virtually indistinguishable from F16. Large files; a safe maximum-quality quant. |
| Q6_K | K-quant | 6.5625 | Excellent | 6-bit K-quant; near-indistinguishable from F16 in most evaluations. The top choice when VRAM allows. |
| Q5_K | K-quant | 5.5 | Very good | Base 5-bit K-quant. Q5_K_M (a Q5_K/Q6_K tensor mix) is a high-quality pick for modest extra size. |
| Q4_K | K-quant | 4.5 | Good | Base 4-bit K-quant. Q4_K_M (a Q4_K/Q6_K mix, ~4.8 bpw effective) is the recommended default for most local models. |
| Q4_0 | Legacy | 4.5 | Medium | Legacy round-to-nearest 4-bit; superseded by Q4_K_M, which is higher quality at similar size. |
| IQ4_XS | I-quant | 4.25 | Good | Importance-matrix 4-bit; excellent quality-per-bit, often matching Q4_K_S at a smaller size. Slower on some CPUs. |
| IQ4_NL | I-quant | 4.25 | Good | Importance-matrix 4-bit non-linear; similar size to IQ4_XS, tuned for non-linear weight distributions. |
| IQ3_S | I-quant | 3.44 | Medium | Importance-matrix 3-bit; better quality-per-bit than Q3_K at a comparable size. |
| Q3_K | K-quant | 3.4375 | Medium-low | Base 3-bit K-quant (Q3_K_S/M/L are tensor mixes around this). Visible quality loss on smaller models. |
| IQ3_XXS | I-quant | 3.06 | Low-medium | Importance-matrix 3-bit, very small; best reserved for larger models. |
| Q2_K | K-quant | 2.625 | Low | Smallest K-quant; noticeable quality loss. Use only for tight VRAM on large (30B+) models. |
| IQ2_S | I-quant | 2.5 | Low | Importance-matrix 2-bit; preserves more quality per bit than Q2_K. Large models only. |
| IQ2_XS | I-quant | 2.31 | Very low | Importance-matrix 2-bit, extra small; viable only on very large models. |
| IQ2_XXS | I-quant | 2.06 | Very low | Importance-matrix 2-bit, smallest practical 2-bit; large models only, with real quality loss. |
| IQ1_M | I-quant | 1.75 | Lowest | Importance-matrix 1-bit; extreme compression, only usable on the largest (70B+) models. |
| IQ1_S | I-quant | 1.56 | Lowest | Importance-matrix 1-bit, smallest; experimental, heavy quality loss, 70B+ only. |
Source: the GGUF quantization-type descriptions in the HuggingFace Hub docs and llama.cpp’s quant-descriptions. Related glossary: quantization, GGUF, perplexity. Tools: GPU/model VRAM fit, VRAM calculator, local-LLM model database, inference-cost calculator.
Related products, repair, and setup paths
- self-hosted AI for Bitcoiners hub
- plebs guide to self-hosted AI
- install Ollama in 10 minutes
- LM Studio vs Ollama vs llama.cpp
- connect local AI to Home Assistant and Obsidian
- self-hosted AI troubleshooting
- repurpose mining hardware into an AI hashcenter
- local AI model leaderboards
Last reviewed June 19, 2026.
