K-quants (llama.cpp)

Sovereign AI

K-quants are a family of quantization formats in the llama.cpp / GGUF ecosystem, denoted by names such as Q4_K_M, Q5_K_S, and Q6_K. The "K" signals a super-block structure that allocates bits more intelligently than the older legacy formats (Q4_0, Q5_1, and kin). Rather than applying one flat bit-width and one scale to a large stretch of weights, K-quants group weights into super-blocks — typically 256 values — subdivided into smaller blocks, each carrying its own quantized scale and minimum. This hierarchical layout spends metadata precision where it does the most good, which is why a K-quant at a given size almost always beats a legacy quant of the same size on quality.

Decoding the name

A K-quant name packs three facts. The leading number is the nominal bit-width: Q4_K stores most weights around 4 bits per weight, Q5_K around 5, Q6_K around 6. The trailing letter — S, M, or L for small, medium, large — selects how generously the sensitive tensors are treated. Several variants are mixed precision: the bulk of the model sits at the nominal width while tensors that damage quality most when squeezed (attention and embedding layers, notably) are kept a step higher. Q4_K_M, for example, holds most weights at 4 bits but stores selected sensitive layers at 6 bits, giving it a markedly better quality-to-size ratio than the legacy Q4_0 it superseded. Q6_K sits close enough to the 8-bit Q8_0 in output quality that many operators treat it as the "effectively lossless" tier for chat use.

Choosing a level in practice

The working heuristic in the local-inference community: Q4_K_M is the default sweet spot — the smallest level whose quality loss is hard to notice in ordinary use; Q5_K_M buys a little more fidelity if memory allows; Q6_K is for when you want near-full quality; and levels below Q4 (Q3_K, Q2_K) are compromises accepted only to fit a model that otherwise would not load at all. A useful rule of thumb runs the other way, too: a larger model at Q4 usually beats a smaller model at Q8 for the same memory footprint, because parameter count matters more than the last bits of per-weight precision. Your ceiling is set by VRAM (plus system RAM for layers offloaded to CPU), and your throughput in tokens per second falls as the file grows, since generation speed is dominated by how many bytes of weights must stream through memory per token.

Why local operators care

The family has since been joined by the newer IQ formats (IQ2, IQ3, and kin), importance-matrix-driven quantization types aimed at the very low bit-widths where classic K-quants degrade sharply; they squeeze surprisingly usable output from 2–3 bit budgets at some cost in decoding speed on CPU. K-quants remain the mainstream tier — better supported, faster on most hardware, and thoroughly battle-tested across the model zoo. Since every quantization is a lossy snapshot of the original weights, serious operators spot-check a new quant against a few of their own real prompts rather than trusting the level's reputation alone; a quant that benchmarks fine can still stumble on your specific domain vocabulary.

K-quants are the default choice for most people running models on the GGUF runtime — the format served by llama.cpp and by front-ends such as Ollama — because they deliver close to full-precision quality at a fraction of the memory. For a self-hoster balancing hardware budget against output quality, picking the K-quant level is often the single most consequential decision in a local deployment. Quality at low bit-widths is further improved when the quantizer is guided by an importance matrix, which measures which weights matter most on real text before deciding where to spend precision. K-quants live inside the GGUF container; the wider landscape of methods is surveyed under LLM quantization.

K-quants are a family of quantization formats in the llama.cpp / GGUF ecosystem, denoted by names such as Q4_K_M, Q5_K_S, and Q6_K. The “K” signals…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners