Local LLM VRAM Calculator: How Much GPU Memory Does Your AI Model Need?
Running an open-weight LLM locally gives you complete data sovereignty: no API calls leave your network, no third party sees your prompts, and inference speed is bounded only by your hardware, not a queue. The critical planning variable is always the same — does your GPU (or Apple Silicon unified memory) have enough VRAM to hold the model weights plus the KV cache for your target context length?
This calculator implements the formula used by llama.cpp and ExLlamaV2 for VRAM estimation: weight footprint + KV cache at your context length + 10 % framework overhead. All estimates are approximate — MoE sparsity, rope scaling, flash attention, and batch size all shift actual usage; the calculator hedges accordingly. Verify with nvidia-smi on your actual setup.
D-Central Technologies stands on the shoulders of the open-source community — llama.cpp (Georgi Gerganov et al.), ExLlamaV2 (turboderp), Ollama, vLLM, and the model teams at Meta, Mistral AI, Alibaba Qwen, and Google DeepMind who publish open weights and architectural details. This calculator cites their model cards and documentation throughout.
VRAM calculator
B params (0.1 – 10 000)
Affects KV-cache size; GQA uses 4–8× fewer KV heads
GPU actively generating tokens (rest = idle)
How the VRAM formula works
The total VRAM required to run an LLM locally has three components. Each can be computed independently, then summed.
1. Weight footprint
Every parameter in the model is stored as a floating-point or quantized number in GPU memory. The weight footprint is simply:
VRAM_weights (GB) = params_B × bytes_per_weight Where bytes_per_weight (GGUF format, llama.cpp): FP16 = 2.000 bytes (16 bits, no quantization) Q8_0 ≈ 1.016 bytes (8 bits + 1×f32 block scale per 32 weights = 8.125 bits) Q5_K_M ≈ 0.709 bytes (mixed 5-bit + 6-bit per block ≈ 5.67 bits) Q4_K_M ≈ 0.563 bytes (mixed 4-bit + 6-bit per block ≈ 4.5 bits)
Source: llama.cpp GGUF documentation; quantization format details from ggml-quants.c.
Practical examples: A 7 B model at FP16 occupies 7 × 2.0 = 14 GB; at Q4_K_M it occupies 7 × 0.563 ≈ 3.9 GB. A 70 B model at Q4_K_M requires 70 × 0.563 ≈ 39.4 GB — comfortably on a 48 GB professional GPU.
| Quantization | Bits / weight | Bytes / param | 7 B model | 70 B model | Quality vs FP16 |
|---|---|---|---|---|---|
| FP16 | 16 | 2.000 | 14.0 GB | 140 GB | Reference (100 %) |
| Q8_0 | ~8.5 | 1.016 | 7.1 GB | 71 GB | Near-lossless (>99 %) |
| Q5_K_M | ~5.7 | 0.709 | 5.0 GB | 50 GB | High (>97 % on >7B) |
| Q4_K_M | ~4.5 | 0.563 | 3.9 GB | 39 GB | Good (>95 % on >13B; varies on small models) |
Quality estimates are approximate, based on community perplexity benchmarks (llama.cpp wiki, TheBloke quantization comparisons). Actual quality depends on the model, task, and temperature settings.
2. KV cache
The KV (key-value) cache holds the attention state for every token in the current context window. Without it, the model would re-process the entire prompt at every generation step. The cache is normally stored in FP16 (2 bytes per element) regardless of weight quantization, and it scales linearly with context length.
VRAM_kv (GB) = 2 (K + V)
× n_layers
× n_kv_heads ← GQA reduces this vs MHA
× head_dim
× context_tokens
× 2 bytes (FP16)
÷ 2³⁰ (convert bytes → GB)
Since the architecture parameters (n_layers, n_kv_heads, head_dim) vary by model family, the calculator uses verified reference values from published model cards and interpolates log-linearly for intermediate sizes. See the KV cache reference section below.
3. Framework overhead
llama.cpp, ExLlamaV2, and vLLM allocate additional memory for attention masks, activation buffers, temporary tensors, and the compute graph. This typically adds 8–15 % on top of weights + KV cache. The calculator uses a conservative 10 % flat overhead, consistent with llama.cpp’s --verbose output on typical workloads.
The complete formula
VRAM_total (GB) = VRAM_weights
+ VRAM_kv_cache
+ VRAM_overhead (10 % of above sum)
Example — 7 B model, Q4_K_M, 4 K context (GQA):
Weights = 7 × 0.563 = 3.94 GB
KV cache = 4 K tokens × 0.122 GB/1K tokens = 0.49 GB
Overhead = 10 % × (3.94 + 0.49) = 0.44 GB
Total ≈ 4.87 GB → fits 6 GB tier
KV cache: the hidden VRAM cost
At short context lengths (2–4 K tokens), KV cache is a minor factor — typically 5–15 % of total VRAM. But at 32 K or 128 K context, it can exceed the weight footprint for large models, fundamentally changing which hardware tier you need.
The cache size is architecture-dependent, not just parameter-count-dependent. The critical variables are:
- n_layers — more transformer blocks = more KV state per token
- n_kv_heads — Grouped Query Attention (GQA) reduces this vs Multi-Head Attention (MHA); Llama 3.1 8B uses 8 KV heads vs 32 query heads (4× reduction)
- head_dim — typically 64 or 128 in modern models
| Model | Layers | KV heads | Head dim | GB / 1K tokens (FP16) | KV at 32K ctx | Source |
|---|---|---|---|---|---|---|
| Llama 3.2 1B | 16 | 8 (GQA) | 64 | 0.031 GB | 0.99 GB | Meta model card |
| Llama 3.2 3B | 28 | 8 (GQA) | 128 | 0.107 GB | 3.4 GB | Meta model card |
| Llama 3.1 8B | 32 | 8 (GQA) | 128 | 0.122 GB | 3.9 GB | Meta model card |
| Qwen2.5 14B | 48 | 8 (GQA) | 128 | 0.183 GB | 5.9 GB | Alibaba model card |
| Llama 3.1 70B | 80 | 8 (GQA) | 128 | 0.305 GB | 9.8 GB | Meta model card |
| Llama 3.1 405B | 126 | 16 (GQA) | 128 | 0.962 GB | 30.8 GB | Meta model card |
| Llama 2 7B (MHA) | 32 | 32 (MHA) | 128 | 0.488 GB | 15.6 GB | Meta Llama 2 paper |
| Llama 2 13B (MHA) | 40 | 40 (MHA) | 128 | 0.763 GB | 24.4 GB | Meta Llama 2 paper |
MHA (red rows) requires dramatically more KV cache. Always select the correct architecture type in the calculator above. KV cache figures assume FP16 storage; some frameworks support FP8 KV cache (reduces KV VRAM by ~50 % with minor quality impact — not yet default in llama.cpp as of 2026-06).
Key insight: At 32 K context, a Llama 2 7B model (MHA) needs ~15.6 GB of KV cache alone, pushing total VRAM well above 24 GB. The same context on Llama 3.1 8B (GQA) needs only 3.9 GB KV — a 4× reduction, enabling the model to run on a 12 GB GPU. This architectural change is why GQA adoption became near-universal after 2023.
GPU VRAM tiers and which models fit
The calculator recommends the minimum tier that holds your total VRAM requirement. More VRAM headroom means longer context, larger batches, and faster inference (fewer or no CPU layer offloads).
| Tier | Example GPU(s) | TDP (mfr spec) | Fits comfortably | Tight / offload needed |
|---|---|---|---|---|
| 6–8 GB | RTX 4060 8 GB | 115 W | ≤ 7B Q4, 3B Q8 | 7B Q4 at >4K context; 8B Q5 |
| 12–16 GB | RTX 4070 12 GB · RTX 4060 Ti 16 GB | 200 W · 165 W | 8B Q8, 13B Q4, 7B FP16 | 13B Q5; 14B Q4 at long ctx |
| 24 GB | RTX 4090 24 GB · RTX 3090 | 450 W · 350 W | 27B Q4, 13B Q8, 8B FP16 | 34B Q4 at short ctx |
| 48 GB | RTX 6000 Ada · A40 | 300 W · 300 W | 70B Q4 (tight), 30B Q8 | 70B Q8; Llama 4 Scout INT4 |
| 80 GB | H100 PCIe · A100 80 GB | 350 W · 400 W | 70B Q8, 70B FP16 (tight) | 405B Q4 (needs multi-GPU) |
| 160 GB (2×80) | 2× H100 PCIe · 2× A100 | 700 W (pair) | 180B Q4, 70B FP16 | 405B Q4 (needs 4× GPU) |
| Cluster | 4–8× H100 SXM · DGX H100 node | >3 kW per node | 405B Q8, DeepSeek V4 Pro | Contact D-Central for cluster design |
TDP figures from NVIDIA official product pages. Inference typically draws 60–80 % of TDP; peak (FP16 training) hits TDP. “Fits comfortably” assumes 4 K context; longer context shifts boundaries. See Local AI Hardware Guide for model-to-hardware mapping.
Apple Silicon note: M1/M2/M3/M4 Macs use unified memory shared between CPU and GPU. A Mac Studio M2 Ultra with 192 GB can run 70B models at Q8 or FP16 comfortably. Bandwidth (~800 GB/s on M4 Ultra) is competitive with H100 PCIe for memory-bound LLM inference. llama.cpp and Ollama both support Metal backend natively.
Running costs: electricity by Canadian province
Local inference runs 24/7 against your electricity bill. Québec’s hydro rates make it one of the cheapest places on Earth to run a GPU inference hashcenter; Nunavut’s diesel-generation rates make the same workload 10× more expensive. The calculator above uses the approximate values in the table below; for precise current tariffs see /canada-electricity-rates-by-province/.
| Province / Territory | Approx. rate (CAD/kWh) | Monthly cost, RTX 4090 8h/day | Notes |
|---|---|---|---|
| Québec | $0.071 | ~$18 / mo | Hydro-Québec residential; Block 1 |
| Manitoba | $0.098 | ~$25 / mo | Manitoba Hydro residential |
| British Columbia | $0.130 | ~$33 / mo | BC Hydro Step 2 rate |
| New Brunswick | $0.140 | ~$36 / mo | NB Power residential |
| Newfoundland & Labrador | $0.147 | ~$37 / mo | NL Hydro residential |
| Yukon | $0.161 | ~$41 / mo | Yukon Energy residential |
| Ontario | $0.165 | ~$42 / mo | TOU blended average |
| Alberta | $0.175 | ~$45 / mo | Regulated Rate Option avg; market rate varies |
| Saskatchewan | $0.185 | ~$47 / mo | SaskPower residential |
| Nova Scotia | $0.218 | ~$56 / mo | Nova Scotia Power residential |
| PEI | $0.235 | ~$60 / mo | Maritime Electric residential |
| Northwest Territories | $0.370 | ~$94 / mo | NT Power residential |
| Nunavut | $0.680 | ~$172 / mo | Qulliq Energy, diesel generation |
Monthly cost estimates assume RTX 4090 (350 W load, 35 W idle) running active inference 8 h/day, idle the remaining 16 h/day. All rates are approximate as of 2026-06; rates change seasonally and by tariff block. Verify at your utility’s website. Full breakdown: /canada-electricity-rates-by-province/.
Running local AI in Québec costs roughly 9× less than in Nunavut. If you are designing a sovereign AI inference hashcenter rather than a personal workstation, province-level electricity cost can dwarf hardware amortization over a 3-year horizon. See /energy-for-compute/ for a deeper analysis of Canadian AI compute economics and /quebec-hydro-ai-compute/ for Québec-specific feasibility data.
Frequently asked questions
How does quantization affect model output quality?
Quality degradation scales inversely with model size. At Q4_K_M, small models (< 7 B parameters) can show measurable perplexity degradation on reasoning and instruction-following benchmarks; the same quantization on a 70 B model typically retains over 97 % of FP16 quality. Q5_K_M and Q8_0 are near-lossless for all model sizes above 3 B. The practical rule: if you are fitting a small model into a small GPU, consider whether a quantized large model on better hardware would serve you better. Community perplexity comparisons are published in the llama.cpp quantization discussion and TheBloke’s model cards on Hugging Face.
What is KV cache and why does context length multiply my VRAM requirements?
The KV (key-value) cache stores the compressed attention state — the “memory” of every token the model has already processed in the current session. Without it, the model would need to re-read the entire prompt for every new token generated, making inference impractically slow. The cache is stored in GPU memory and grows linearly with context length: doubling your context window doubles the KV cache. At 4 K context it is a minor factor (5–15 % of total VRAM for GQA models); at 128 K context it can exceed the weight footprint for large models. The KV cache is normally stored in FP16 regardless of weight quantization.
Can I run a model that is larger than my VRAM using CPU offloading?
Yes — llama.cpp supports --n-gpu-layers N to keep the N most-recently used transformer layers in GPU VRAM and offload the rest to system RAM. ExLlamaV2 has equivalent functionality. A model that needs 24 GB can run on a 16 GB GPU + 8 GB of fast DDR5 RAM. The trade-off is speed: system RAM bandwidth (50–80 GB/s for DDR5) is 8–15× slower than GPU VRAM bandwidth (700–3,350 GB/s depending on GPU generation). Expect 3–10× slower token generation per offloaded layer group. For interactive chat this is tolerable; for batch inference it is usually not worth the latency.
My model card says “70 B MoE with 8 B active parameters” — do I need 70 B or 8 B of VRAM?
You need the full 70 B VRAM. Mixture-of-Experts (MoE) models activate only a subset of their expert feed-forward blocks per token (8 B in this example), but all expert weights must be resident in memory because the router selects different experts for each token dynamically. Llama 4 Scout, for instance, has 109 B total parameters with 17 B active per token — it still requires approximately 55 GB at INT4. The “active parameters” figure describes compute throughput and power draw, not memory footprint. See the Local AI Hardware Guide for detailed MoE VRAM requirements.
What is GQA and why does it dramatically change KV cache requirements?
Grouped Query Attention (GQA), introduced in Ainslie et al. (2023), reduces the number of key-value attention heads while keeping the full number of query heads. Llama 3.1 8B has 32 query heads but only 8 KV heads — a 4× reduction in KV cache size compared to standard Multi-Head Attention (MHA). Llama 2 7B (MHA) needs 0.488 GB of KV cache per 1 K context tokens; Llama 3.1 8B (GQA) needs only 0.122 GB — a 4× improvement despite being a similar-size model. GQA became the standard architecture for virtually all open-weight models released after mid-2023. When using the calculator, selecting “MHA” for a Llama 3 model will significantly over-estimate KV cache.
How do I measure actual VRAM usage when running a model locally?
The most direct method on NVIDIA GPUs is watch -n1 nvidia-smi in a terminal — it refreshes usage stats every second, showing per-process VRAM allocation. When you start llama.cpp with --verbose, it prints a detailed breakdown of VRAM allocation (model weights, KV cache, scratch buffers) at startup. On Apple Silicon, use Activity Monitor → Memory tab → GPU History. For AMD ROCm GPUs: rocm-smi or radeontop. For Ollama, the dashboard at http://localhost:11434 shows loaded model sizes. Actual usage will typically be within 5–15 % of this calculator’s estimates; the 10 % overhead factor is deliberately conservative.
Related tools and guides on D-Central
- Local AI Hardware Guide — model-to-hardware mapping table: which GPU tier or Apple Silicon config fits every major open-weight model from 1 B to 405 B
- Running Local LLMs in Canada — privacy law, practical setup, and Canadian hardware sourcing guide
- Canada Electricity Rates by Province — full provincial rate tables with utility sources and commercial tier data
- Energy for Compute — economics of AI inference power: how to evaluate total cost of ownership for a local GPU inference deployment
- Distributed Compute — multi-node inference and GPU cluster design for models that exceed single-machine VRAM capacity
- Sovereign AI in Canada — strategic guide to Canadian digital sovereignty in the AI era
- AI Sovereignty Consulting — D-Central advises Canadian organizations on private GPU inference infrastructure design
- Mining Profitability Calculator — same electricity inputs, for Bitcoin mining ASIC workloads
- Power Cost Calculator — watt-hour to dollar converter with Canadian provincial rates
Related products, repair, and setup paths
- self-hosted AI for Bitcoiners hub
- plebs guide to self-hosted AI
- install Ollama in 10 minutes
- LM Studio vs Ollama vs llama.cpp
- connect local AI to Home Assistant and Obsidian
- self-hosted AI troubleshooting
- repurpose mining hardware into an AI hashcenter
- local AI model leaderboards
Last reviewed June 18, 2026.
