Definition
EXL2 is the quantization format used by ExLlamaV2, a fast inference library for running large language models on consumer-class GPUs. It builds on the same optimization ideas as earlier second-order methods but adds fine-grained mixed precision: a single model can blend 2, 3, 4, 5, 6, and 8-bit weights to reach essentially any target average bitrate between 2 and 8 bits per weight. For a sovereign operator with a fixed amount of VRAM, this means dialing in the exact size-versus-quality tradeoff your card can hold rather than being stuck with rigid bit-width tiers.
Per-layer and per-column precision
EXL2 does not apply one uniform bit-width across the whole model. It allocates more bits to layers and even individual weight columns that are sensitive to precision loss, and fewer bits where the model tolerates aggressive compression. The result is something close to sparse quantization, where the most important weights are stored at higher precision within an otherwise heavily compressed tensor. This per-layer optimization tends to give better quality than uniform quantization at the same average bitrate.
What it enables
The format is designed for GPU inference. Documented examples include running a 70-billion-parameter model on a single 24 GB card at around 2.55 bits per weight, and fitting 13B models into 8 GB of VRAM at roughly 2.65 bits. That makes capable local models reachable on hardware many enthusiasts already own.
EXL2 is one of several local formats; for the block-wise CPU/GPU alternative see GGUF, and for grounding see LLM quantization.
Pick a bitrate for your VRAM in the GPU–LLM fit dataset.
In Simple Terms
EXL2 is the quantization format used by ExLlamaV2, a fast inference library for running large language models on consumer-class GPUs. It builds on the same…
