Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

GGUF Quantization Quality — Bits-per-Weight & Which Quant to Use

Which GGUF quant should you run? Every GGUF LLM quantization type by exact bits-per-weight, quant family and quality tier, so you can pick one that both fits your VRAM and stays accurate. Free CSV/JSON + REST under CC BY 4.0.

Quick answer

GGUF quantization shrinks a local LLM by storing each weight in fewer bits — the trade-off is file size and VRAM against output quality. This reference lists 19 GGUF quant types by their exact bits-per-weight, the family (full-precision floats, K-quants, importance-matrix I-quants, and legacy round-to-nearest types) and a quality tier, so you can pick one that BOTH fits your VRAM and stays accurate. Rule of thumb: Q4_K_M (~4.8 bpw) is the recommended default; use Q5_K_M or Q6_K when you have VRAM to spare; Q8_0 is effectively lossless; and below ~3 bpw prefer the importance-matrix I-quants, only on large models.

Q4_K_M is the sweet spot for most local models. Below ~3 bpw, the importance-matrix I-quants (IQ3 / IQ2) keep more quality per bit and should be reserved for large models; Q6_K and Q8_0 are near-lossless when VRAM allows. Pair this with D-Central's GPU/model VRAM data to confirm a quant both FITS and stays accurate.

Download CSV Download JSON REST API →

Quant typeFamilyBits/weightQualityNotes
F32Floating point32ReferenceFull single precision; the unquantized training baseline. Rarely used for local inference (huge files).
F16Floating point16ReferenceHalf precision; the near-lossless inference baseline that quants are measured against.
BF16Floating point16ReferenceBrain-float 16; same size as F16 with a wider exponent range; a common training/inference format.
Q8_0Legacy8.5Near-lossless8-bit round-to-nearest; virtually indistinguishable from F16. Large files; a safe maximum-quality quant.
Q6_KK-quant6.5625Excellent6-bit K-quant; near-indistinguishable from F16 in most evaluations. The top choice when VRAM allows.
Q5_KK-quant5.5Very goodBase 5-bit K-quant. Q5_K_M (a Q5_K/Q6_K tensor mix) is a high-quality pick for modest extra size.
Q4_KK-quant4.5GoodBase 4-bit K-quant. Q4_K_M (a Q4_K/Q6_K mix, ~4.8 bpw effective) is the recommended default for most local models.
Q4_0Legacy4.5MediumLegacy round-to-nearest 4-bit; superseded by Q4_K_M, which is higher quality at similar size.
IQ4_XSI-quant4.25GoodImportance-matrix 4-bit; excellent quality-per-bit, often matching Q4_K_S at a smaller size. Slower on some CPUs.
IQ4_NLI-quant4.25GoodImportance-matrix 4-bit non-linear; similar size to IQ4_XS, tuned for non-linear weight distributions.
IQ3_SI-quant3.44MediumImportance-matrix 3-bit; better quality-per-bit than Q3_K at a comparable size.
Q3_KK-quant3.4375Medium-lowBase 3-bit K-quant (Q3_K_S/M/L are tensor mixes around this). Visible quality loss on smaller models.
IQ3_XXSI-quant3.06Low-mediumImportance-matrix 3-bit, very small; best reserved for larger models.
Q2_KK-quant2.625LowSmallest K-quant; noticeable quality loss. Use only for tight VRAM on large (30B+) models.
IQ2_SI-quant2.5LowImportance-matrix 2-bit; preserves more quality per bit than Q2_K. Large models only.
IQ2_XSI-quant2.31Very lowImportance-matrix 2-bit, extra small; viable only on very large models.
IQ2_XXSI-quant2.06Very lowImportance-matrix 2-bit, smallest practical 2-bit; large models only, with real quality loss.
IQ1_MI-quant1.75LowestImportance-matrix 1-bit; extreme compression, only usable on the largest (70B+) models.
IQ1_SI-quant1.56LowestImportance-matrix 1-bit, smallest; experimental, heavy quality loss, 70B+ only.

Source: the GGUF quantization-type descriptions in the HuggingFace Hub docs and llama.cpp’s quant-descriptions. Related glossary: quantization, GGUF, perplexity. Tools: GPU/model VRAM fit, VRAM calculator, local-LLM model database, inference-cost calculator.