Quantization (LLM)

Sovereign AI

Quantization is the process of reducing the numerical precision used to store a large language model's weights, shrinking the model so it fits in less memory and runs faster. A model trained at 16-bit precision can be quantized to 8-bit, 4-bit, or lower, cutting its memory footprint by half or more. This is the single most important technique for running capable local models on consumer hardware instead of renting cloud GPUs.

The precision-versus-quality trade-off

Lower precision means each weight is stored with fewer bits, which reduces VRAM use and increases speed, but it can slightly degrade output quality. Modern quantization formats — such as GGUF quant levels (for example Q4_K_M or Q5_K_M) and methods like GPTQ and AWQ — are designed to minimize that loss. In practice, 4-bit and 5-bit quants of a larger model often outperform an unquantized smaller model that uses the same memory, so quantization frequently improves real-world results on a fixed hardware budget.

How quantization actually works

A quantizer does not simply chop bits off each number. Weights are grouped into small blocks, and each block stores its values as low-bit integers plus one or two higher-precision scale factors that map those integers back to real values at inference time. This block-wise scaling is why a "4-bit" GGUF file works out to slightly more than 4 bits per weight on disk. The k-quant family goes further by mixing precisions inside one file: layers and tensors that are most sensitive to error (attention output projections, embeddings) keep more bits, while more forgiving tensors are squeezed harder. Methods like GPTQ and AWQ add a calibration step, running sample text through the model and choosing quantized values that minimize the measured output error rather than the raw rounding error, while importance-matrix (imatrix) GGUF quants apply the same idea to preserve the weights that activations actually exercise.

Practical sizing math

The arithmetic is simple enough to do on a napkin. At 16-bit precision every parameter costs 2 bytes, so a 7-billion-parameter model needs roughly 14 GB for weights alone — beyond most consumer GPUs. At 8-bit it drops to about 7 GB; at 4-bit, to roughly 4 GB, which fits comfortably on an 8 GB card with room left for the KV cache and context. That same scaling is what lets a 70B-class model, hopeless at full precision on home hardware, become feasible on a dual-GPU workstation or a high-memory Apple Silicon machine at 4-bit. Always budget extra memory beyond the weight file: the KV cache grows with context length and can add gigabytes on long conversations.

Where the quality cliff sits

Quantization loss is not linear. Going from 16-bit to 8-bit is essentially free for most models; 5-bit and 4-bit quants typically cost a small, often imperceptible amount of quality; below roughly 3 bits per weight, degradation accelerates sharply and models become noticeably less coherent, weaker at math and code, and more prone to repetition. Small models suffer proportionally more than large ones at the same bit depth because they have less redundancy to absorb the error. The pragmatic rule for a fixed memory budget: pick the largest model whose 4-bit or 5-bit quant fits, rather than a smaller model at higher precision — and test on your own tasks, because sensitivity varies by workload.

Why it matters for sovereign AI

Quantization is what makes running an LLM locally practical: a quantized model can run entirely on a self-owned GPU with no data leaving your premises. For Bitcoiners extending self-custody to their AI tooling, it is the difference between depending on a cloud API and owning the full stack. The same sovereignty logic that argues for running your own node argues for quantizing a strong open-weight model onto hardware you control.

Check which models fit your card on D-Central's GPU and LLM compatibility guide.

Full open-data reference: GGUF Quantization Quality Reference — CSV / JSON + REST API, CC BY 4.0.

Quantization is the process of reducing the numerical precision used to store a large language model’s weights, shrinking the model so it fits in less…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners