AI GPU Database: VRAM, TFLOPS, TDP for Local LLM Inference
D-Central’s AI GPU Database is a free, citable reference of 30 GPU and accelerator records — the numbers that actually matter for running a large language model locally: VRAM GB, memory bandwidth, FP16 TFLOPS, INT8 TOPS, TDP, and a plain-English inference tier. Data is sourced directly from NVIDIA, AMD, Apple, and Intel official product pages and datasheets, cross-referenced with TechPowerUp GPU Database. Published under CC BY 4.0; verify at source before making purchasing decisions.
Key insight for local AI: For single-stream LLM token generation, memory bandwidth is the dominant bottleneck — not compute TFLOPS. A card with more GB/s will generate tokens faster even if its FP16 number looks lower. VRAM size determines which models fit without CPU offload. FP16/INT8 TFLOPS matter primarily for batch inference and prompt-processing speed.
Inference tier guide
- Tier 1 Data Center — enterprise / cloud only; not home-deployable
- Tier 2 Prosumer — serious home AI server; runs 70B+ models quantised
- Tier 3 Capable — 13B–34B models comfortably; strong home use
- Tier 4 Mid-Range — 7B–13B models; reasonable home use
- Tier 5 Entry — 7B limit; VRAM or bandwidth-constrained
GPU and accelerator specifications
| GPU / Accelerator | Manufacturer | VRAM GB | VRAM Type | Bandwidth GB/s | FP16 TFLOPS | INT8 TOPS | TDP W | Tier |
|---|---|---|---|---|---|---|---|---|
| NVIDIA H100 SXM 80 GBHopper (GH100) · HBM3 | NVIDIA | 80 | HBM3 | 3,350 | 1,979 | 3,958 | 700 | 1 |
| Frontier inference GPU; FP16 tensor-core dense, official NVIDIA datasheet. Cloud/enterprise only. Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/h100/ | ||||||||
| NVIDIA A100 SXM4 80 GBAmpere DC (GA100) · HBM2e | NVIDIA | 80 | HBM2e | 2,000 | 312 | 624 | 400 | 1 |
| FP16 tensor-core dense, NVIDIA A100 datasheet. Available used through secondary market. Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/a100/ | ||||||||
| NVIDIA L40S 48 GBAda Lovelace (AD102) · GDDR6 | NVIDIA | 48 | GDDR6 | 864 | ~183 ⓘ | ~366 ⓘ | 350 | 1 |
| 48 GB enables 70B FP16. FP16/INT8 = dense tensor estimates; NVIDIA datasheets list the sparse figure (~2× dense). Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/l40s/ | ||||||||
| NVIDIA L4 24 GBAda Lovelace (AD104) · GDDR6 | NVIDIA | 24 | GDDR6 | 300 | ~121 ⓘ | 242 | 72 | 2 |
| Efficiency standout: 72 W TDP, low-profile PCIe, 24 GB. INT8 = 242 TOPS confirmed from NVIDIA. FP16 derived as INT8/2. Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/l4/ | ||||||||
| NVIDIA RTX A6000 (Ampere) 48 GBAmpere (GA102) · GDDR6 | NVIDIA | 48 | GDDR6 | 768 | 155 | ~310 ⓘ | 300 | 2 |
| FP16 = 154.83 TFLOPS tensor-core dense, NVIDIA official datasheet. 48 GB + NVLink (96 GB dual-card). Available used. Credit: NVIDIA Corporation — nvidia.com/…/rtx-a6000/ | ||||||||
| NVIDIA GeForce RTX 5090Blackwell (GB202) · GDDR7 | NVIDIA | 32 | GDDR7 | 1,792 | ~210 ⓘ | — | 575 | 2 |
| Highest-bandwidth consumer GPU (1,792 GB/s, GDDR7). Launched Jan 2025. FP16 = shader estimate. NVIDIA does not publish GeForce tensor TFLOPS. Credit: NVIDIA Corporation — nvidia.com/…/rtx-5090/ | ||||||||
| NVIDIA GeForce RTX 4090Ada Lovelace (AD102) · GDDR6X | NVIDIA | 24 | GDDR6X | 1,008 | 165.2 | 330.3 | 450 | 2 |
| FP16 165.2 and INT8 330.3 TOPS = published by NVIDIA (shader FP16 = 2×82.6 FP32). Gold-standard home inference. Credit: NVIDIA Corporation — nvidia.com/…/rtx-4090/ | ||||||||
| Apple M4 Max (40-core GPU, up to 128 GB)Apple Silicon M4 · Unified | Apple | 128 ‡ | Unified | 546 | ~18.4 ⓘ | 38 ‡ | — ‡ | 2 |
| ‡ VRAM = unified memory (no GDDR bus). INT8 TOPS = 16-core Neural Engine (separate from GPU). TDP not published per chip. FP16 = GPU shader third-party estimate. Runs 70B FP16 — uniquely, without quantisation. Credit: Apple Inc. — apple.com newsroom | ||||||||
| NVIDIA GeForce RTX 3090Ampere (GA102) · GDDR6X | NVIDIA | 24 | GDDR6X | 936 | ~71 ⓘ | ~142 ⓘ | 350 | 3 |
| Best-value used card for 24 GB. FP16 confidence: moderate (search results diverged between 71 and 142; we use 71.16 = 2×35.58 FP32, the lower/conservative value). Credit: NVIDIA Corporation. | ||||||||
| NVIDIA GeForce RTX 5080Blackwell (GB203) · GDDR7 | NVIDIA | 16 | GDDR7 | 960 | 112.6 | — | 360 | 3 |
| FP16 112.6 TFLOPS = consistently cited from multiple sources; Blackwell launched Jan 2025. Credit: NVIDIA Corporation — nvidia.com/…/rtx-5080/ | ||||||||
| NVIDIA GeForce RTX 4080 SuperAda Lovelace (AD103) · GDDR6X | NVIDIA | 16 | GDDR6X | 736 | ~104 ⓘ | — | 320 | 3 |
| Credit: NVIDIA Corporation — nvidia.com/…/rtx-4080-family/ | ||||||||
| NVIDIA GeForce RTX 4080Ada Lovelace (AD103) · GDDR6X | NVIDIA | 16 | GDDR6X | 717 | ~97.5 ⓘ | — | 320 | 3 |
| FP32 49 TFLOPS confirmed by NVIDIA (“49 Shader-TFLOPs”). FP16 = 2×. Credit: NVIDIA Corporation. | ||||||||
| AMD Radeon RX 7900 XTXRDNA 3 (Navi 31) · GDDR6 | AMD | 24 | GDDR6 | 960 | 123 | 123 | 355 | 3 |
| FP16 Matrix (AI Accelerator) = 123 TFLOPS & INT8 Matrix = 123 TOPS per AMD official product page (RDNA 3 delivers same throughput for FP16 and INT8 matrix). Inference via ROCm (Linux) or Vulkan/DirectML. Credit: AMD — amd.com RX 7900 XTX | ||||||||
| Apple M3 Max (40-core GPU, up to 128 GB)Apple Silicon M3 · Unified | Apple | 128 ‡ | Unified | 400 | ~16.4 ⓘ | 18 ‡ | — ‡ | 3 |
| ‡ Unified memory / Neural Engine — same caveats as M4 Max. 128 GB enables 70B Q4 locally. Credit: Apple Inc. — apple.com M3 newsroom | ||||||||
| NVIDIA GeForce RTX 4070 Ti SuperAda Lovelace (AD103) · GDDR6X | NVIDIA | 16 | GDDR6X | 672 | ~88 ⓘ | — | 285 | 3 |
| Credit: NVIDIA Corporation — nvidia.com/…/rtx-4070-ti-super/ | ||||||||
| AMD Radeon RX 7900 XTRDNA 3 (Navi 31) · GDDR6 | AMD | 20 | GDDR6 | 800 | 103 | 103 | 315 | 3 |
| FP16 Matrix = 103 TFLOPS, INT8 Matrix = 103 TOPS per WareDB (sourced from AMD official). 20 GB sweet spot. Credit: AMD — amd.com RX 7900 XT | ||||||||
| Apple M4 Pro (20-core GPU, up to 64 GB)Apple Silicon M4 · Unified | Apple | 64 ‡ | Unified | 273 | ~9.2 ⓘ | 38 ‡ | — ‡ | 3 |
| ‡ Same Apple caveats. FP16 = LOW confidence estimate (half of M4 Max). 64 GB unified memory. Credit: Apple Inc. — Apple tech specs MBP M4 Pro | ||||||||
| NVIDIA GeForce RTX 4070 SuperAda Lovelace (AD104) · GDDR6X | NVIDIA | 12 | GDDR6X | 504 | ~71 ⓘ | — | 220 | 4 |
| Credit: NVIDIA Corporation — nvidia.com/…/rtx-4070-super/ | ||||||||
| AMD Radeon RX 7800 XTRDNA 3 (Navi 32) · GDDR6 | AMD | 16 | GDDR6 | 576 | 74.6 | 74.6 | 263 | 4 |
| FP16 Matrix = 74.6 & INT8 Matrix = 74.6 TOPS per AMD official product page. 16 GB mid-range. Credit: AMD — amd.com RX 7800 XT | ||||||||
| Intel Arc A770 16 GBXe-HPG Alchemist (ACM-G10) · GDDR6 | Intel | 16 | GDDR6 | 560 | 39.4 | — | 225 | 4 |
| FP16 39.4 TFLOPS = published by Intel official product page. 16 GB GDDR6, XMX AI acceleration. INT8 not published for consumer Arc. Credit: Intel Corporation — intel.com Arc A770 specs | ||||||||
| AMD Radeon RX 7700 XTRDNA 3 (Navi 32) · GDDR6 | AMD | 12 | GDDR6 | 432 | 70.3 | 70.3 | 245 | 4 |
| FP16 Matrix = 70.3 TFLOPS per AMD official spec page (108 AI Accelerators). Credit: AMD — amd.com RX 7700 XT | ||||||||
| NVIDIA GeForce RTX 4070Ada Lovelace (AD104) · GDDR6X | NVIDIA | 12 | GDDR6X | 504 | ~58.5 ⓘ | — | 200 | 4 |
| Credit: NVIDIA Corporation — nvidia.com/…/rtx-4070/ | ||||||||
| NVIDIA GeForce RTX 4060 Ti 16 GBAda Lovelace (AD106) · GDDR6 | NVIDIA | 16 | GDDR6 | 288 | ~44 ⓘ | — | 165 | 4 |
| 16 GB main advantage; bandwidth 288 GB/s is a bottleneck (~3× slower token gen vs. RTX 4090 despite same VRAM). Credit: NVIDIA Corporation. | ||||||||
| NVIDIA GeForce RTX 3080 12 GBAmpere (GA102) · GDDR6X | NVIDIA | 12 | GDDR6X | 912 | ~61 ⓘ | — | 350 | 4 |
| 912 GB/s bandwidth makes this a fast token generator despite “only” 12 GB. Good used value. Credit: NVIDIA Corporation. | ||||||||
| AMD Radeon RX 6900 XTRDNA 2 (Navi 21) · GDDR6 | AMD | 16 | GDDR6 | 512 | ~46 ⓘ | — | 300 | 4 |
| RDNA 2 — no dedicated AI Accelerators. FP16 = shader estimate. 16 GB at good used prices. ROCm support for older gen. Credit: AMD — amd.com RX 6900 XT | ||||||||
| Intel Arc B580 12 GBXe2-HPG Battlemage (BMG-G21) · GDDR6 | Intel | 12 | GDDR6 | 456 | 27.3 | — | 190 | 4 |
| FP16 27.34 TFLOPS = Intel official product page. $249 launch (Dec 2024). 160 XMX engines for matrix acceleration. INT8 not published. Credit: Intel Corporation — intel.com Arc B580 specs | ||||||||
| NVIDIA GeForce RTX 4060 Ti 8 GBAda Lovelace (AD106) · GDDR6 | NVIDIA | 8 | GDDR6 | 288 | ~44 ⓘ | — | 160 | 5 |
| Same compute as 16 GB variant. 8 GB VRAM is the binding AI constraint. Prefer 16 GB for AI work. Credit: NVIDIA Corporation. | ||||||||
| NVIDIA GeForce RTX 4060Ada Lovelace (AD107) · GDDR6 | NVIDIA | 8 | GDDR6 | 272 | ~30 ⓘ | — | 115 | 5 |
| 115 W TDP standout. 8 GB limits to 7B Q4. Good entry point. Credit: NVIDIA Corporation — nvidia.com/…/rtx-4060/ | ||||||||
| AMD Radeon RX 7600RDNA 3 (Navi 33) · GDDR6 | AMD | 8 | GDDR6 | 288 | ~42.8 ⓘ | ~42.8 ⓘ | 165 | 5 |
| 8 GB RDNA 3 entry. 64 AI Accelerators. FP16/INT8 sourced from RDNA 3 architecture data via gpupoet (not directly from amd.com product page — moderate confidence). Credit: AMD — amd.com RX 7600 | ||||||||
| NVIDIA GeForce RTX 3070Ampere (GA104) · GDDR6 | NVIDIA | 8 | GDDR6 | 448 | ~40.6 ⓘ | — | 220 | 5 |
| Widely available used at low prices. 8 GB limits to 7B Q4. Better bandwidth than RTX 4060 8 GB despite same VRAM class. Credit: NVIDIA Corporation. | ||||||||
ⓘ = estimated value (not directly confirmed from official spec page). ‡ = Apple Silicon caveat applies (unified memory / Neural Engine / TDP not published per chip). See per-row notes and the Methodology page. For machine-readable access, use the REST API.
Frequently asked questions
- Which GPU is best for running local LLMs at home?
- For most home users, the NVIDIA GeForce RTX 4090 (24 GB, 1,008 GB/s) is the best single-card choice — it has the VRAM to run 34B models at FP16 and the bandwidth to generate tokens quickly. If budget is the constraint, the RTX 3090 offers nearly the same VRAM and bandwidth at a much lower used price. Apple Silicon (M4 Max with 128 GB unified memory) is the only platform that can run 70B models at full FP16 precision without a dedicated GPU costing more than a car.
- Why does memory bandwidth matter more than compute TFLOPS for local AI?
- Modern LLM inference in single-stream mode is memory-bandwidth-bound, not compute-bound. The GPU must stream the entire model weight from VRAM for every generated token. A card with more GB/s will generate tokens faster even if its FP16 number looks lower. Compute TFLOPS matter more for long-context prompt processing (prefill) and batch inference, where the arithmetic intensity rises.
- Can I use an AMD GPU for local AI?
- Yes. AMD RDNA 3 GPUs work with ROCm on Linux for frameworks like PyTorch and llama.cpp (ROCm/HIP backend). On Windows, Vulkan and DirectML backends are available via llama.cpp and Ollama. The RX 7900 XTX (24 GB) and RX 7900 XT (20 GB) are competitive with NVIDIA options for inference throughput, with the same AI Accelerator FP16/INT8 matrix performance. ROCm tooling is improving but is not yet as seamless as NVIDIA’s CUDA.
- What about Intel Arc GPUs?
- Intel Arc GPUs (A770 16 GB, B580 12 GB) are a viable budget option, especially for the 16 GB A770. They use Intel’s XMX matrix engines for FP16/INT8 acceleration and are supported via IPEX-LLM, OpenVINO, and llama.cpp’s SYCL backend. CUDA is not available. Ecosystem maturity is lower than NVIDIA or AMD ROCm.
- What does “INT8 TOPS” mean and why is it missing for many cards?
- INT8 TOPS (tera-operations per second at 8-bit integer precision) measures how fast a GPU can run quantised model inference. NVIDIA does not publish INT8 TOPS for GeForce consumer cards (it appears only on pro/data-center datasheets). Where shown for consumer NVIDIA cards, the figure is estimated. AMD does publish AI Accelerator INT8 TOPS on its product pages for RDNA 3 — notably, RDNA 3 delivers the same throughput for FP16 and INT8 matrix operations.
- What is unified memory (Apple) and how does it compare to GPU VRAM?
- Apple Silicon uses a unified memory architecture where the same physical memory pool is shared between CPU, GPU, and Neural Engine with no bus-transfer overhead. “VRAM GB” for Apple entries in this table reflects the maximum unified-memory configuration, not a discrete VRAM pool. This means an M4 Max with 128 GB can dedicate all 128 GB to model weights if needed — something no consumer discrete GPU offers. The trade-off is that memory bandwidth (546 GB/s for M4 Max vs. 1,008 GB/s for RTX 4090) is lower, so token generation on the same model is somewhat slower per watt.
- What is the smallest GPU that can run a 13B model?
- A 13B parameter model in FP16 requires approximately 26 GB of VRAM (13B × 2 bytes). At Q4 quantisation (~0.5 bytes/parameter), 13B fits in about 7 GB. So: a card with 8 GB VRAM (RTX 4060, RX 7600) can run a 13B model at Q4 quantisation but not at FP16. A 16 GB card (RTX 4080, RX 7900 XTX) handles 13B at FP16 comfortably. A 24 GB card handles 13B at FP16 with room to spare.
- How often is this dataset updated?
- The dataset was last verified in June 2026 against official manufacturer spec pages. GPU specifications are stable once a card is released, but new models are released regularly. We aim to add new entries and correct any discrepancies as they are identified. Check
last_verifiedin the CSV/JSON and verify key specifications at the manufacturer’s source before purchasing.
Related datasets and tools
- Bench Measurements — D-Central’s first-hand ASIC performance measurements from our Laval bench
- Open Data Catalog — all D-Central open datasets with API endpoints
- Running Local LLMs in Canada — guide to setting up private AI inference on your own hardware
- AI & Bitcoin Mining Convergence — D-Central’s AI vertical
- Distributed Compute — sovereign AI infrastructure for Canadian businesses
- Bitcoin Mining Field Manual — technical guides for ASIC deployment and hashcenter operations
Cite this dataset
This dataset is published under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence. You are free to share, adapt, and use this data for any purpose, including commercially, as long as you give appropriate credit.
APA
D-Central Technologies. (2026). AI & Local-Inference GPU Database (v1.0) [Dataset]. https://d-central.tech/data/ai-gpu-database/. CC BY 4.0.
Chicago
D-Central Technologies. “AI & Local-Inference GPU Database.” Version 1.0. Dataset. 2026. https://d-central.tech/data/ai-gpu-database/. CC BY 4.0.
BibTeX@misc{dcentral2026gpudb,
author = {{D-Central Technologies}},
title = {AI & Local-Inference {GPU} Database},
year = {2026},
version = {1.0},
howpublished = {url{https://d-central.tech/data/ai-gpu-database/}},
note = {CC BY 4.0}
}
Specifications sourced from NVIDIA Corporation, AMD (Advanced Micro Devices), Apple Inc., and Intel Corporation official product pages and datasheets; cross-referenced with TechPowerUp GPU Database and WareDB. As of June 2026 — verify at source before purchasing decisions. Not financial or purchasing advice.
Related products, repair, and setup paths
- immersion cooling hub
- home immersion cooling guide
- ASIC miners for immersion planning
- ASIC cooling parts
- airflow shroud before immersion
- compare miner specs in the database
- ASIC repair support
Last reviewed June 15, 2026.
