GPU vs Local LLM Compatibility: Which Models Can Your GPU Run?

Cross-reference 30 GPUs against 33 open local LLMs to see exactly which models fit in VRAM at Q4, Q8 and FP16.

Quick answer

This dataset cross-references 30 GPUs against 63 open local LLMs (1890 fit records), computing whether each model fits in VRAM at Q4, Q8 and FP16 — with headroom and a recommended quantization. Use it to answer "which local LLM can my GPU run?" before you buy hardware or pull a model.

The dataset now carries two fit columns per pair: weights-only (q4/q8_fits) and a KV-cache-aware fits_at_context that adds the FP16 attention cache at a 32,768-token stress context (capped to each model's native window) plus a ~1 GB runtime margin. A card can pass weights-only yet fail at real context — a 24 GB GPU holds Qwen3-32B's Q4 weights (20 GB) but OOMs near 29 GB once 32k of KV cache is added. Check fits_at_context before you provision. Free to download (CSV/JSON) and query via API under CC BY 4.0.

Download CSV Download JSON REST API →

GPU	VRAM	Models fit (Q4)	Q8	FP16	Largest runnable (weights only)	Recommended home model (Q8)
Apple M4 Max (40-core GPU, up to 128 GB)	128 GB	53 / 63	50	43	Qwen3.5 122B-A10B (MoE)	Command R+ 104B
Apple M3 Max (40-core GPU, up to 128 GB)	128 GB	53 / 63	50	43	Qwen3.5 122B-A10B (MoE)	Command R+ 104B
NVIDIA H100 SXM 80 GB	80 GB	52 / 63	47	41	gpt-oss-120b	Llama 3.3 Nemotron Super 49B v1.5
NVIDIA A100 SXM4 80 GB	80 GB	52 / 63	47	41	gpt-oss-120b	Llama 3.3 Nemotron Super 49B v1.5
Apple M4 Pro (20-core GPU, up to 64 GB)	64 GB	51 / 63	43	32	GLM-4.5-Air	Llama 3.3 Nemotron Super 49B v1.5
NVIDIA L40S 48 GB	48 GB	49 / 63	42	25	Hunyuan-A13B-Instruct	Qwen3.5 35B-A3B (MoE)
NVIDIA RTX A6000 (Ampere) 48 GB	48 GB	49 / 63	42	25	Hunyuan-A13B-Instruct	Qwen3.5 35B-A3B (MoE)
NVIDIA GeForce RTX 5090	32 GB	44 / 63	31	24	Llama 3.3 Nemotron Super 49B v1.5	Gemma 3 27B Instruct
NVIDIA L4 24 GB	24 GB	42 / 63	25	16	Qwen3.5 35B-A3B (MoE)	Phi-4 14B Instruct
NVIDIA GeForce RTX 4090	24 GB	42 / 63	25	16	Qwen3.5 35B-A3B (MoE)	Phi-4 14B Instruct
NVIDIA GeForce RTX 3090	24 GB	42 / 63	25	16	Qwen3.5 35B-A3B (MoE)	Phi-4 14B Instruct
AMD Radeon RX 7900 XTX	24 GB	42 / 63	25	16	Qwen3.5 35B-A3B (MoE)	Phi-4 14B Instruct
AMD Radeon RX 7900 XT	20 GB	39 / 63	24	15	Command R 35B (08-2024)	Phi-4 14B Instruct
NVIDIA GeForce RTX 5080	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
NVIDIA GeForce RTX 4080 Super	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
NVIDIA GeForce RTX 4080	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
NVIDIA GeForce RTX 4070 Ti Super	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
AMD Radeon RX 7800 XT	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
Intel Arc A770 16 GB	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
NVIDIA GeForce RTX 4060 Ti 16 GB	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
AMD Radeon RX 6900 XT	16 GB	26 / 63	24	10	gpt-oss-20b	OLMo 2 13B Instruct
NVIDIA GeForce RTX 4070 Super	12 GB	25 / 63	16	6	Phi-4 14B Instruct	NVIDIA Nemotron Nano 9B v2
AMD Radeon RX 7700 XT	12 GB	25 / 63	16	6	Phi-4 14B Instruct	NVIDIA Nemotron Nano 9B v2
NVIDIA GeForce RTX 4070	12 GB	25 / 63	16	6	Phi-4 14B Instruct	NVIDIA Nemotron Nano 9B v2
NVIDIA GeForce RTX 3080 12 GB	12 GB	25 / 63	16	6	Phi-4 14B Instruct	NVIDIA Nemotron Nano 9B v2
Intel Arc B580 12 GB	12 GB	25 / 63	16	6	Phi-4 14B Instruct	NVIDIA Nemotron Nano 9B v2
NVIDIA GeForce RTX 4060 Ti 8 GB	8 GB	19 / 63	10	5	Gemma 3 12B Instruct	Gemma 3 4B Instruct
NVIDIA GeForce RTX 4060	8 GB	19 / 63	10	5	Gemma 3 12B Instruct	Gemma 3 4B Instruct
AMD Radeon RX 7600	8 GB	19 / 63	10	5	Gemma 3 12B Instruct	Gemma 3 4B Instruct
NVIDIA GeForce RTX 3070	8 GB	19 / 63	10	5	Gemma 3 12B Instruct	Gemma 3 4B Instruct

Method: A model "fits" (weights-only) when the GPU's VRAM is at least the model's weight footprint at that quantization. Quality grades headroom: tight (<1.15× weights — weights-only, little room for context), good (<1.6×), ample (≥1.6×). The dataset's fits_at_context columns go further: they add the FP16 KV cache at a 32,768-token stress context (capped to each model's native window) plus a ~1 GB runtime margin, with the KV cache computed from each model's published architecture (2 × layers × kv_heads × head_dim × 2 bytes per token). Models whose attention geometry is unverified or non-standard are flagged unknown-architecture and report null KV rather than a guess. Source data: the AI-GPU database & local-LLM model database.

Related products, repair, and setup paths

Last reviewed June 18, 2026.