Definition
K-quants are a family of quantization formats in llama.cpp, the GGUF-based ecosystem, denoted by names such as Q4_K_M, Q5_K_S, and Q6_K. The "K" signals a super-block structure that allocates bits more cleverly than the older legacy formats. Rather than applying one flat bit-width to an entire tensor, K-quants group weights into super-blocks (typically 256 values) subdivided into smaller blocks, each carrying its own quantized scale and minimum. This hierarchical layout lets the format spend metadata and precision where it does the most good.
Mixed precision and the size suffix
Several K-quant variants are mixed precision: most weights are stored at the nominal bit-width, while the most sensitive tensors (such as attention and embedding layers) are kept at a slightly higher bit-width. The trailing letter indicates the tradeoff, with S, M, and L denoting small, medium, and large within a given level. Q4_K_M, for example, keeps the bulk of weights at 4 bits but stores selected sensitive layers at 6 bits, giving it a noticeably better quality-to-size ratio than the legacy 4-bit Q4_0.
Why local operators care
K-quants are the default choice for most people running models on the GGUF runtime, because they deliver close to full-precision quality at a fraction of the memory. For self-hosters balancing VRAM, RAM, and disk against output quality, picking the right K-quant level is often the single most consequential decision when setting up local inference.
K-quants live inside the GGUF format and are produced by llama.cpp; their quality is often improved using an importance matrix.
In Simple Terms
K-quants are a family of quantization formats in llama.cpp, the GGUF-based ecosystem, denoted by names such as Q4_K_M, Q5_K_S, and Q6_K. The “K” signals a…
