EXL2 (ExLlamaV2 format)

Sovereign AI

EXL2 is the quantization format used by ExLlamaV2, a fast inference library for running large language models on consumer-class GPUs. It builds on the same second-order optimization ideas as earlier GPU quantization methods but adds fine-grained mixed precision: a single model can blend 2, 3, 4, 5, 6, and 8-bit weights to hit essentially any target average bitrate between 2 and 8 bits per weight. For a sovereign operator with a fixed amount of VRAM, that means dialing in the exact size-versus-quality trade-off a particular card can hold, rather than being stuck choosing between rigid bit-width tiers that either waste headroom or refuse to fit.

Per-layer and per-column precision

EXL2 does not apply one uniform bit-width across the whole model. During conversion, the quantizer measures how sensitive each layer — and even individual weight columns within a layer — is to precision loss, then allocates more bits where errors would propagate and fewer where the model tolerates aggressive compression. The result approaches sparse quantization: the most important weights survive at higher precision inside an otherwise heavily compressed tensor. At the same average bitrate, this calibrated allocation tends to preserve more quality than uniform quantization, which is the format's core advantage. The philosophy will feel familiar to miners: like an autotuner mapping each hashboard's silicon rather than applying one flat setting, EXL2 measures the actual sensitivity of the weights it is compressing and budgets accordingly.

What it enables

The format is designed specifically for GPU inference. Documented examples include running a 70-billion-parameter model on a single 24 GB card at around 2.55 bits per weight, and fitting 13B-class models into 8 GB of VRAM at roughly 2.65 bits. Those numbers put genuinely capable local models within reach of hardware many enthusiasts already own — a used gaming card in a home server, not a rented cloud instance. That is the sovereignty case in miniature: weights on your disk, prompts that never leave your LAN, and no API bill or terms-of-service between you and the model.

Where EXL2 fits among formats

Working with EXL2 in practice is mostly a matter of matching bitrate to memory. Community-converted checkpoints are published at a spread of average bitrates — commonly from around 2.4 up through 6 and 8 bits per weight — and the operator picks the largest that fits after accounting for the KV cache, which grows with the context window and can rival the weights themselves in long-context use. Quality falls gently as bitrate drops until a knee somewhere below 3 bits, where degradation becomes obvious; a larger model at low bitrate often still beats a smaller model at high bitrate, which is the counterintuitive rule of thumb worth testing on your own workload. Conversion from raw weights requires a calibration pass, but for most users the practical path is downloading pre-quantized builds. The payoff for the fuss is real: on a single consumer card, a well-chosen EXL2 build usually delivers the best interactive tokens-per-second available for models that fit fully in VRAM.

EXL2 is a GPU-first format: it assumes the whole model lives in VRAM and rewards that assumption with high generation speed through the ExLlamaV2 runtime, also drivable from front ends such as text-generation-webui. The block-wise GGUF family takes the opposite bet, supporting CPU execution and CPU/GPU splitting for machines without enough VRAM, while calibrated 4-bit formats like GPTQ and AWQ serve batching-oriented server stacks. Choosing among them is mostly a question of hardware: if the model fits entirely on your GPU, an EXL2 build at a tuned bitrate is often the fastest single-user path; if it does not, GGUF's offloading wins. For the concepts underneath, see LLM quantization and k-quants.

Pick a bitrate for your VRAM in the GPU–LLM fit dataset.

EXL2 is the quantization format used by ExLlamaV2, a fast inference library for running large language models on consumer-class GPUs. It builds on the same…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners