Local LLM VRAM Calculator: How Much GPU Memory Does Your AI Model Need?

How much VRAM does a local LLM need? Multiply parameter count (billions) by bytes-per-weight for your quantization — FP16 = 2.0 B, Q8 ≈ 1.0 B, Q5 ≈ 0.71 B, Q4 ≈ 0.56 B — then add a 10 % framework overhead and the KV cache for your context window. A 7 B model at Q4 K_M fits in 6 GB; at FP16 it needs ~16 GB plus KV cache. Use the calculator below for your exact configuration.

Running an open-weight LLM locally gives you complete data sovereignty: no API calls leave your network, no third party sees your prompts, and inference speed is bounded only by your hardware, not a queue. The critical planning variable is always the same — does your GPU (or Apple Silicon unified memory) have enough VRAM to hold the model weights plus the KV cache for your target context length?

This calculator implements the formula used by llama.cpp and ExLlamaV2 for VRAM estimation: weight footprint + KV cache at your context length + 10 % framework overhead. All estimates are approximate — MoE sparsity, rope scaling, flash attention, and batch size all shift actual usage; the calculator hedges accordingly. Verify with nvidia-smi on your actual setup.

D-Central Technologies stands on the shoulders of the open-source community — llama.cpp (Georgi Gerganov et al.), ExLlamaV2 (turboderp), Ollama, vLLM, and the model teams at Meta, Mistral AI, Alibaba Qwen, and Google DeepMind who publish open weights and architectural details. This calculator cites their model cards and documentation throughout.

VRAM calculator

Model size (billion params):

B params (0.1 – 10 000)

Quantization:

Context length (K tokens):

Attention architecture:

Affects KV-cache size; GQA uses 4–8× fewer KV heads

Active inference hours/day:

GPU actively generating tokens (rest = idle)

Canadian province / rate:

How the VRAM formula works

The total VRAM required to run an LLM locally has three components. Each can be computed independently, then summed.

1. Weight footprint

Every parameter in the model is stored as a floating-point or quantized number in GPU memory. The weight footprint is simply:

VRAM_weights (GB) = params_B × bytes_per_weight

Where bytes_per_weight (GGUF format, llama.cpp):
  FP16    = 2.000 bytes  (16 bits, no quantization)
  Q8_0    ≈ 1.016 bytes  (8 bits + 1×f32 block scale per 32 weights = 8.125 bits)
  Q5_K_M  ≈ 0.709 bytes  (mixed 5-bit + 6-bit per block ≈ 5.67 bits)
  Q4_K_M  ≈ 0.563 bytes  (mixed 4-bit + 6-bit per block ≈ 4.5 bits)

Source: llama.cpp GGUF documentation; quantization format details from ggml-quants.c.

Practical examples: A 7 B model at FP16 occupies 7 × 2.0 = 14 GB; at Q4_K_M it occupies 7 × 0.563 ≈ 3.9 GB. A 70 B model at Q4_K_M requires 70 × 0.563 ≈ 39.4 GB — comfortably on a 48 GB professional GPU.

Quantization	Bits / weight	Bytes / param	7 B model	70 B model	Quality vs FP16
FP16	16	2.000	14.0 GB	140 GB	Reference (100 %)
Q8_0	~8.5	1.016	7.1 GB	71 GB	Near-lossless (>99 %)
Q5_K_M	~5.7	0.709	5.0 GB	50 GB	High (>97 % on >7B)
Q4_K_M	~4.5	0.563	3.9 GB	39 GB	Good (>95 % on >13B; varies on small models)

Quality estimates are approximate, based on community perplexity benchmarks (llama.cpp wiki, TheBloke quantization comparisons). Actual quality depends on the model, task, and temperature settings.

2. KV cache

The KV (key-value) cache holds the attention state for every token in the current context window. Without it, the model would re-process the entire prompt at every generation step. The cache is normally stored in FP16 (2 bytes per element) regardless of weight quantization, and it scales linearly with context length.

VRAM_kv (GB) = 2 (K + V)
             × n_layers
             × n_kv_heads      ← GQA reduces this vs MHA
             × head_dim
             × context_tokens
             × 2 bytes (FP16)
             ÷ 2³⁰             (convert bytes → GB)

Since the architecture parameters (n_layers, n_kv_heads, head_dim) vary by model family, the calculator uses verified reference values from published model cards and interpolates log-linearly for intermediate sizes. See the KV cache reference section below.

3. Framework overhead

llama.cpp, ExLlamaV2, and vLLM allocate additional memory for attention masks, activation buffers, temporary tensors, and the compute graph. This typically adds 8–15 % on top of weights + KV cache. The calculator uses a conservative 10 % flat overhead, consistent with llama.cpp’s --verbose output on typical workloads.

The complete formula

VRAM_total (GB) = VRAM_weights
                + VRAM_kv_cache
                + VRAM_overhead (10 % of above sum)

Example — 7 B model, Q4_K_M, 4 K context (GQA):
  Weights   = 7 × 0.563                            = 3.94 GB
  KV cache  = 4 K tokens × 0.122 GB/1K tokens     = 0.49 GB
  Overhead  = 10 % × (3.94 + 0.49)                = 0.44 GB
  Total     ≈ 4.87 GB                              → fits 6 GB tier

KV cache: the hidden VRAM cost

At short context lengths (2–4 K tokens), KV cache is a minor factor — typically 5–15 % of total VRAM. But at 32 K or 128 K context, it can exceed the weight footprint for large models, fundamentally changing which hardware tier you need.

The cache size is architecture-dependent, not just parameter-count-dependent. The critical variables are:

n_layers — more transformer blocks = more KV state per token
n_kv_heads — Grouped Query Attention (GQA) reduces this vs Multi-Head Attention (MHA); Llama 3.1 8B uses 8 KV heads vs 32 query heads (4× reduction)
head_dim — typically 64 or 128 in modern models

Model	Layers	KV heads	Head dim	GB / 1K tokens (FP16)	KV at 32K ctx	Source
Llama 3.2 1B	16	8 (GQA)	64	0.031 GB	0.99 GB	Meta model card
Llama 3.2 3B	28	8 (GQA)	128	0.107 GB	3.4 GB	Meta model card
Llama 3.1 8B	32	8 (GQA)	128	0.122 GB	3.9 GB	Meta model card
Qwen2.5 14B	48	8 (GQA)	128	0.183 GB	5.9 GB	Alibaba model card
Llama 3.1 70B	80	8 (GQA)	128	0.305 GB	9.8 GB	Meta model card
Llama 3.1 405B	126	16 (GQA)	128	0.962 GB	30.8 GB	Meta model card
Llama 2 7B (MHA)	32	32 (MHA)	128	0.488 GB	15.6 GB	Meta Llama 2 paper
Llama 2 13B (MHA)	40	40 (MHA)	128	0.763 GB	24.4 GB	Meta Llama 2 paper

MHA (red rows) requires dramatically more KV cache. Always select the correct architecture type in the calculator above. KV cache figures assume FP16 storage; some frameworks support FP8 KV cache (reduces KV VRAM by ~50 % with minor quality impact — not yet default in llama.cpp as of 2026-06).

Key insight: At 32 K context, a Llama 2 7B model (MHA) needs ~15.6 GB of KV cache alone, pushing total VRAM well above 24 GB. The same context on Llama 3.1 8B (GQA) needs only 3.9 GB KV — a 4× reduction, enabling the model to run on a 12 GB GPU. This architectural change is why GQA adoption became near-universal after 2023.

GPU VRAM tiers and which models fit

The calculator recommends the minimum tier that holds your total VRAM requirement. More VRAM headroom means longer context, larger batches, and faster inference (fewer or no CPU layer offloads).

Tier	Example GPU(s)	TDP (mfr spec)	Fits comfortably	Tight / offload needed
6–8 GB	RTX 4060 8 GB	115 W	≤ 7B Q4, 3B Q8	7B Q4 at >4K context; 8B Q5
12–16 GB	RTX 4070 12 GB · RTX 4060 Ti 16 GB	200 W · 165 W	8B Q8, 13B Q4, 7B FP16	13B Q5; 14B Q4 at long ctx
24 GB	RTX 4090 24 GB · RTX 3090	450 W · 350 W	27B Q4, 13B Q8, 8B FP16	34B Q4 at short ctx
48 GB	RTX 6000 Ada · A40	300 W · 300 W	70B Q4 (tight), 30B Q8	70B Q8; Llama 4 Scout INT4
80 GB	H100 PCIe · A100 80 GB	350 W · 400 W	70B Q8, 70B FP16 (tight)	405B Q4 (needs multi-GPU)
160 GB (2×80)	2× H100 PCIe · 2× A100	700 W (pair)	180B Q4, 70B FP16	405B Q4 (needs 4× GPU)
Cluster	4–8× H100 SXM · DGX H100 node	>3 kW per node	405B Q8, DeepSeek V4 Pro	Contact D-Central for cluster design

TDP figures from NVIDIA official product pages. Inference typically draws 60–80 % of TDP; peak (FP16 training) hits TDP. “Fits comfortably” assumes 4 K context; longer context shifts boundaries. See Local AI Hardware Guide for model-to-hardware mapping.

Apple Silicon note: M1/M2/M3/M4 Macs use unified memory shared between CPU and GPU. A Mac Studio M2 Ultra with 192 GB can run 70B models at Q8 or FP16 comfortably. Bandwidth (~800 GB/s on M4 Ultra) is competitive with H100 PCIe for memory-bound LLM inference. llama.cpp and Ollama both support Metal backend natively.

Running costs: electricity by Canadian province

Local inference runs 24/7 against your electricity bill. Québec’s hydro rates make it one of the cheapest places on Earth to run a GPU inference hashcenter; Nunavut’s diesel-generation rates make the same workload 10× more expensive. The calculator above uses the approximate values in the table below; for precise current tariffs see /canada-electricity-rates-by-province/.

Province / Territory	Approx. rate (CAD/kWh)	Monthly cost, RTX 4090 8h/day	Notes
Québec	$0.071	~$18 / mo	Hydro-Québec residential; Block 1
Manitoba	$0.098	~$25 / mo	Manitoba Hydro residential
British Columbia	$0.130	~$33 / mo	BC Hydro Step 2 rate
New Brunswick	$0.140	~$36 / mo	NB Power residential
Newfoundland & Labrador	$0.147	~$37 / mo	NL Hydro residential
Yukon	$0.161	~$41 / mo	Yukon Energy residential
Ontario	$0.165	~$42 / mo	TOU blended average
Alberta	$0.175	~$45 / mo	Regulated Rate Option avg; market rate varies
Saskatchewan	$0.185	~$47 / mo	SaskPower residential
Nova Scotia	$0.218	~$56 / mo	Nova Scotia Power residential
PEI	$0.235	~$60 / mo	Maritime Electric residential
Northwest Territories	$0.370	~$94 / mo	NT Power residential
Nunavut	$0.680	~$172 / mo	Qulliq Energy, diesel generation

Monthly cost estimates assume RTX 4090 (350 W load, 35 W idle) running active inference 8 h/day, idle the remaining 16 h/day. All rates are approximate as of 2026-06; rates change seasonally and by tariff block. Verify at your utility’s website. Full breakdown: /canada-electricity-rates-by-province/.

Running local AI in Québec costs roughly 9× less than in Nunavut. If you are designing a sovereign AI inference hashcenter rather than a personal workstation, province-level electricity cost can dwarf hardware amortization over a 3-year horizon. See /energy-for-compute/ for a deeper analysis of Canadian AI compute economics and /quebec-hydro-ai-compute/ for Québec-specific feasibility data.

Frequently asked questions

How does quantization affect model output quality?

Quality degradation scales inversely with model size. At Q4_K_M, small models (< 7 B parameters) can show measurable perplexity degradation on reasoning and instruction-following benchmarks; the same quantization on a 70 B model typically retains over 97 % of FP16 quality. Q5_K_M and Q8_0 are near-lossless for all model sizes above 3 B. The practical rule: if you are fitting a small model into a small GPU, consider whether a quantized large model on better hardware would serve you better. Community perplexity comparisons are published in the llama.cpp quantization discussion and TheBloke’s model cards on Hugging Face.

What is KV cache and why does context length multiply my VRAM requirements?

The KV (key-value) cache stores the compressed attention state — the “memory” of every token the model has already processed in the current session. Without it, the model would need to re-read the entire prompt for every new token generated, making inference impractically slow. The cache is stored in GPU memory and grows linearly with context length: doubling your context window doubles the KV cache. At 4 K context it is a minor factor (5–15 % of total VRAM for GQA models); at 128 K context it can exceed the weight footprint for large models. The KV cache is normally stored in FP16 regardless of weight quantization.

Can I run a model that is larger than my VRAM using CPU offloading?

Yes — llama.cpp supports --n-gpu-layers N to keep the N most-recently used transformer layers in GPU VRAM and offload the rest to system RAM. ExLlamaV2 has equivalent functionality. A model that needs 24 GB can run on a 16 GB GPU + 8 GB of fast DDR5 RAM. The trade-off is speed: system RAM bandwidth (50–80 GB/s for DDR5) is 8–15× slower than GPU VRAM bandwidth (700–3,350 GB/s depending on GPU generation). Expect 3–10× slower token generation per offloaded layer group. For interactive chat this is tolerable; for batch inference it is usually not worth the latency.

My model card says “70 B MoE with 8 B active parameters” — do I need 70 B or 8 B of VRAM?

You need the full 70 B VRAM. Mixture-of-Experts (MoE) models activate only a subset of their expert feed-forward blocks per token (8 B in this example), but all expert weights must be resident in memory because the router selects different experts for each token dynamically. Llama 4 Scout, for instance, has 109 B total parameters with 17 B active per token — it still requires approximately 55 GB at INT4. The “active parameters” figure describes compute throughput and power draw, not memory footprint. See the Local AI Hardware Guide for detailed MoE VRAM requirements.

What is GQA and why does it dramatically change KV cache requirements?

Grouped Query Attention (GQA), introduced in Ainslie et al. (2023), reduces the number of key-value attention heads while keeping the full number of query heads. Llama 3.1 8B has 32 query heads but only 8 KV heads — a 4× reduction in KV cache size compared to standard Multi-Head Attention (MHA). Llama 2 7B (MHA) needs 0.488 GB of KV cache per 1 K context tokens; Llama 3.1 8B (GQA) needs only 0.122 GB — a 4× improvement despite being a similar-size model. GQA became the standard architecture for virtually all open-weight models released after mid-2023. When using the calculator, selecting “MHA” for a Llama 3 model will significantly over-estimate KV cache.

How do I measure actual VRAM usage when running a model locally?

The most direct method on NVIDIA GPUs is watch -n1 nvidia-smi in a terminal — it refreshes usage stats every second, showing per-process VRAM allocation. When you start llama.cpp with --verbose, it prints a detailed breakdown of VRAM allocation (model weights, KV cache, scratch buffers) at startup. On Apple Silicon, use Activity Monitor → Memory tab → GPU History. For AMD ROCm GPUs: rocm-smi or radeontop. For Ollama, the dashboard at http://localhost:11434 shows loaded model sizes. Actual usage will typically be within 5–15 % of this calculator’s estimates; the 10 % overhead factor is deliberately conservative.

Related tools and guides on D-Central

Local AI Hardware Guide — model-to-hardware mapping table: which GPU tier or Apple Silicon config fits every major open-weight model from 1 B to 405 B
Running Local LLMs in Canada — privacy law, practical setup, and Canadian hardware sourcing guide
Canada Electricity Rates by Province — full provincial rate tables with utility sources and commercial tier data
Energy for Compute — economics of AI inference power: how to evaluate total cost of ownership for a local GPU inference deployment
Distributed Compute — multi-node inference and GPU cluster design for models that exceed single-machine VRAM capacity
Sovereign AI in Canada — strategic guide to Canadian digital sovereignty in the AI era
AI Sovereignty Consulting — D-Central advises Canadian organizations on private GPU inference infrastructure design
Mining Profitability Calculator — same electricity inputs, for Bitcoin mining ASIC workloads
Power Cost Calculator — watt-hour to dollar converter with Canadian provincial rates

Will it actually run — and how fast?

Fit is only half the answer. Single-user decode speed is memory-bound: every generated token reads the full weight file once, so tokens/sec ≈ memory bandwidth ÷ weight size. The list below pulls the live GPU database and shows the smallest cards that hold your configuration fully in VRAM, with a realistic 50–80% efficiency band (dense model, batch 1). Mixture-of-experts models read fewer weights per token and run faster than these numbers.

Scripts & AI assistants: GET /wp-json/dc/v1/vram-fit?params_b=8&quant=q4&context_k=8 returns this computation as JSON (add bpw= for AWQ/GPTQ/FP8 footprints).

Related products, repair, and setup paths

Last reviewed June 18, 2026.