Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

AI GPU Database: VRAM, TFLOPS, TDP for Local LLM Inference

D-Central’s AI GPU Database is a free, citable reference of 30 GPU and accelerator records — the numbers that actually matter for running a large language model locally: VRAM GB, memory bandwidth, FP16 TFLOPS, INT8 TOPS, TDP, and a plain-English inference tier. Data is sourced directly from NVIDIA, AMD, Apple, and Intel official product pages and datasheets, cross-referenced with TechPowerUp GPU Database. Published under CC BY 4.0; verify at source before making purchasing decisions.

Key insight for local AI: For single-stream LLM token generation, memory bandwidth is the dominant bottleneck — not compute TFLOPS. A card with more GB/s will generate tokens faster even if its FP16 number looks lower. VRAM size determines which models fit without CPU offload. FP16/INT8 TFLOPS matter primarily for batch inference and prompt-processing speed.

30 records · v1.0 · June 2026 · CC BY 4.0

Download CSV
Download JSON
REST API
CC BY 4.0

Inference tier guide

  • Tier 1 Data Center — enterprise / cloud only; not home-deployable
  • Tier 2 Prosumer — serious home AI server; runs 70B+ models quantised
  • Tier 3 Capable — 13B–34B models comfortably; strong home use
  • Tier 4 Mid-Range — 7B–13B models; reasonable home use
  • Tier 5 Entry — 7B limit; VRAM or bandwidth-constrained








GPU and accelerator specifications

GPU / Accelerator Manufacturer VRAM GB VRAM Type Bandwidth GB/s FP16 TFLOPS INT8 TOPS TDP W Tier
NVIDIA H100 SXM 80 GBHopper (GH100) · HBM3 NVIDIA 80 HBM3 3,350 1,979 3,958 700 1
Frontier inference GPU; FP16 tensor-core dense, official NVIDIA datasheet. Cloud/enterprise only. Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/h100/
NVIDIA A100 SXM4 80 GBAmpere DC (GA100) · HBM2e NVIDIA 80 HBM2e 2,000 312 624 400 1
FP16 tensor-core dense, NVIDIA A100 datasheet. Available used through secondary market. Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/a100/
NVIDIA L40S 48 GBAda Lovelace (AD102) · GDDR6 NVIDIA 48 GDDR6 864 ~183 ⓘ ~366 ⓘ 350 1
48 GB enables 70B FP16. FP16/INT8 = dense tensor estimates; NVIDIA datasheets list the sparse figure (~2× dense). Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/l40s/
NVIDIA L4 24 GBAda Lovelace (AD104) · GDDR6 NVIDIA 24 GDDR6 300 ~121 ⓘ 242 72 2
Efficiency standout: 72 W TDP, low-profile PCIe, 24 GB. INT8 = 242 TOPS confirmed from NVIDIA. FP16 derived as INT8/2. Credit: NVIDIA Corporation — nvidia.com/en-us/data-center/l4/
NVIDIA RTX A6000 (Ampere) 48 GBAmpere (GA102) · GDDR6 NVIDIA 48 GDDR6 768 155 ~310 ⓘ 300 2
FP16 = 154.83 TFLOPS tensor-core dense, NVIDIA official datasheet. 48 GB + NVLink (96 GB dual-card). Available used. Credit: NVIDIA Corporation — nvidia.com/…/rtx-a6000/
NVIDIA GeForce RTX 5090Blackwell (GB202) · GDDR7 NVIDIA 32 GDDR7 1,792 ~210 ⓘ 575 2
Highest-bandwidth consumer GPU (1,792 GB/s, GDDR7). Launched Jan 2025. FP16 = shader estimate. NVIDIA does not publish GeForce tensor TFLOPS. Credit: NVIDIA Corporation — nvidia.com/…/rtx-5090/
NVIDIA GeForce RTX 4090Ada Lovelace (AD102) · GDDR6X NVIDIA 24 GDDR6X 1,008 165.2 330.3 450 2
FP16 165.2 and INT8 330.3 TOPS = published by NVIDIA (shader FP16 = 2×82.6 FP32). Gold-standard home inference. Credit: NVIDIA Corporation — nvidia.com/…/rtx-4090/
Apple M4 Max (40-core GPU, up to 128 GB)Apple Silicon M4 · Unified Apple 128 ‡ Unified 546 ~18.4 ⓘ 38 ‡ — ‡ 2
‡ VRAM = unified memory (no GDDR bus). INT8 TOPS = 16-core Neural Engine (separate from GPU). TDP not published per chip. FP16 = GPU shader third-party estimate. Runs 70B FP16 — uniquely, without quantisation. Credit: Apple Inc. — apple.com newsroom
NVIDIA GeForce RTX 3090Ampere (GA102) · GDDR6X NVIDIA 24 GDDR6X 936 ~71 ⓘ ~142 ⓘ 350 3
Best-value used card for 24 GB. FP16 confidence: moderate (search results diverged between 71 and 142; we use 71.16 = 2×35.58 FP32, the lower/conservative value). Credit: NVIDIA Corporation.
NVIDIA GeForce RTX 5080Blackwell (GB203) · GDDR7 NVIDIA 16 GDDR7 960 112.6 360 3
FP16 112.6 TFLOPS = consistently cited from multiple sources; Blackwell launched Jan 2025. Credit: NVIDIA Corporation — nvidia.com/…/rtx-5080/
NVIDIA GeForce RTX 4080 SuperAda Lovelace (AD103) · GDDR6X NVIDIA 16 GDDR6X 736 ~104 ⓘ 320 3
Credit: NVIDIA Corporation — nvidia.com/…/rtx-4080-family/
NVIDIA GeForce RTX 4080Ada Lovelace (AD103) · GDDR6X NVIDIA 16 GDDR6X 717 ~97.5 ⓘ 320 3
FP32 49 TFLOPS confirmed by NVIDIA (“49 Shader-TFLOPs”). FP16 = 2×. Credit: NVIDIA Corporation.
AMD Radeon RX 7900 XTXRDNA 3 (Navi 31) · GDDR6 AMD 24 GDDR6 960 123 123 355 3
FP16 Matrix (AI Accelerator) = 123 TFLOPS & INT8 Matrix = 123 TOPS per AMD official product page (RDNA 3 delivers same throughput for FP16 and INT8 matrix). Inference via ROCm (Linux) or Vulkan/DirectML. Credit: AMD — amd.com RX 7900 XTX
Apple M3 Max (40-core GPU, up to 128 GB)Apple Silicon M3 · Unified Apple 128 ‡ Unified 400 ~16.4 ⓘ 18 ‡ — ‡ 3
‡ Unified memory / Neural Engine — same caveats as M4 Max. 128 GB enables 70B Q4 locally. Credit: Apple Inc. — apple.com M3 newsroom
NVIDIA GeForce RTX 4070 Ti SuperAda Lovelace (AD103) · GDDR6X NVIDIA 16 GDDR6X 672 ~88 ⓘ 285 3
Credit: NVIDIA Corporation — nvidia.com/…/rtx-4070-ti-super/
AMD Radeon RX 7900 XTRDNA 3 (Navi 31) · GDDR6 AMD 20 GDDR6 800 103 103 315 3
FP16 Matrix = 103 TFLOPS, INT8 Matrix = 103 TOPS per WareDB (sourced from AMD official). 20 GB sweet spot. Credit: AMD — amd.com RX 7900 XT
Apple M4 Pro (20-core GPU, up to 64 GB)Apple Silicon M4 · Unified Apple 64 ‡ Unified 273 ~9.2 ⓘ 38 ‡ — ‡ 3
‡ Same Apple caveats. FP16 = LOW confidence estimate (half of M4 Max). 64 GB unified memory. Credit: Apple Inc. — Apple tech specs MBP M4 Pro
NVIDIA GeForce RTX 4070 SuperAda Lovelace (AD104) · GDDR6X NVIDIA 12 GDDR6X 504 ~71 ⓘ 220 4
Credit: NVIDIA Corporation — nvidia.com/…/rtx-4070-super/
AMD Radeon RX 7800 XTRDNA 3 (Navi 32) · GDDR6 AMD 16 GDDR6 576 74.6 74.6 263 4
FP16 Matrix = 74.6 & INT8 Matrix = 74.6 TOPS per AMD official product page. 16 GB mid-range. Credit: AMD — amd.com RX 7800 XT
Intel Arc A770 16 GBXe-HPG Alchemist (ACM-G10) · GDDR6 Intel 16 GDDR6 560 39.4 225 4
FP16 39.4 TFLOPS = published by Intel official product page. 16 GB GDDR6, XMX AI acceleration. INT8 not published for consumer Arc. Credit: Intel Corporation — intel.com Arc A770 specs
AMD Radeon RX 7700 XTRDNA 3 (Navi 32) · GDDR6 AMD 12 GDDR6 432 70.3 70.3 245 4
FP16 Matrix = 70.3 TFLOPS per AMD official spec page (108 AI Accelerators). Credit: AMD — amd.com RX 7700 XT
NVIDIA GeForce RTX 4070Ada Lovelace (AD104) · GDDR6X NVIDIA 12 GDDR6X 504 ~58.5 ⓘ 200 4
Credit: NVIDIA Corporation — nvidia.com/…/rtx-4070/
NVIDIA GeForce RTX 4060 Ti 16 GBAda Lovelace (AD106) · GDDR6 NVIDIA 16 GDDR6 288 ~44 ⓘ 165 4
16 GB main advantage; bandwidth 288 GB/s is a bottleneck (~3× slower token gen vs. RTX 4090 despite same VRAM). Credit: NVIDIA Corporation.
NVIDIA GeForce RTX 3080 12 GBAmpere (GA102) · GDDR6X NVIDIA 12 GDDR6X 912 ~61 ⓘ 350 4
912 GB/s bandwidth makes this a fast token generator despite “only” 12 GB. Good used value. Credit: NVIDIA Corporation.
AMD Radeon RX 6900 XTRDNA 2 (Navi 21) · GDDR6 AMD 16 GDDR6 512 ~46 ⓘ 300 4
RDNA 2 — no dedicated AI Accelerators. FP16 = shader estimate. 16 GB at good used prices. ROCm support for older gen. Credit: AMD — amd.com RX 6900 XT
Intel Arc B580 12 GBXe2-HPG Battlemage (BMG-G21) · GDDR6 Intel 12 GDDR6 456 27.3 190 4
FP16 27.34 TFLOPS = Intel official product page. $249 launch (Dec 2024). 160 XMX engines for matrix acceleration. INT8 not published. Credit: Intel Corporation — intel.com Arc B580 specs
NVIDIA GeForce RTX 4060 Ti 8 GBAda Lovelace (AD106) · GDDR6 NVIDIA 8 GDDR6 288 ~44 ⓘ 160 5
Same compute as 16 GB variant. 8 GB VRAM is the binding AI constraint. Prefer 16 GB for AI work. Credit: NVIDIA Corporation.
NVIDIA GeForce RTX 4060Ada Lovelace (AD107) · GDDR6 NVIDIA 8 GDDR6 272 ~30 ⓘ 115 5
115 W TDP standout. 8 GB limits to 7B Q4. Good entry point. Credit: NVIDIA Corporation — nvidia.com/…/rtx-4060/
AMD Radeon RX 7600RDNA 3 (Navi 33) · GDDR6 AMD 8 GDDR6 288 ~42.8 ⓘ ~42.8 ⓘ 165 5
8 GB RDNA 3 entry. 64 AI Accelerators. FP16/INT8 sourced from RDNA 3 architecture data via gpupoet (not directly from amd.com product page — moderate confidence). Credit: AMD — amd.com RX 7600
NVIDIA GeForce RTX 3070Ampere (GA104) · GDDR6 NVIDIA 8 GDDR6 448 ~40.6 ⓘ 220 5
Widely available used at low prices. 8 GB limits to 7B Q4. Better bandwidth than RTX 4060 8 GB despite same VRAM class. Credit: NVIDIA Corporation.

ⓘ = estimated value (not directly confirmed from official spec page). ‡ = Apple Silicon caveat applies (unified memory / Neural Engine / TDP not published per chip). See per-row notes and the Methodology page. For machine-readable access, use the REST API.

Frequently asked questions

Which GPU is best for running local LLMs at home?
For most home users, the NVIDIA GeForce RTX 4090 (24 GB, 1,008 GB/s) is the best single-card choice — it has the VRAM to run 34B models at FP16 and the bandwidth to generate tokens quickly. If budget is the constraint, the RTX 3090 offers nearly the same VRAM and bandwidth at a much lower used price. Apple Silicon (M4 Max with 128 GB unified memory) is the only platform that can run 70B models at full FP16 precision without a dedicated GPU costing more than a car.
Why does memory bandwidth matter more than compute TFLOPS for local AI?
Modern LLM inference in single-stream mode is memory-bandwidth-bound, not compute-bound. The GPU must stream the entire model weight from VRAM for every generated token. A card with more GB/s will generate tokens faster even if its FP16 number looks lower. Compute TFLOPS matter more for long-context prompt processing (prefill) and batch inference, where the arithmetic intensity rises.
Can I use an AMD GPU for local AI?
Yes. AMD RDNA 3 GPUs work with ROCm on Linux for frameworks like PyTorch and llama.cpp (ROCm/HIP backend). On Windows, Vulkan and DirectML backends are available via llama.cpp and Ollama. The RX 7900 XTX (24 GB) and RX 7900 XT (20 GB) are competitive with NVIDIA options for inference throughput, with the same AI Accelerator FP16/INT8 matrix performance. ROCm tooling is improving but is not yet as seamless as NVIDIA’s CUDA.
What about Intel Arc GPUs?
Intel Arc GPUs (A770 16 GB, B580 12 GB) are a viable budget option, especially for the 16 GB A770. They use Intel’s XMX matrix engines for FP16/INT8 acceleration and are supported via IPEX-LLM, OpenVINO, and llama.cpp’s SYCL backend. CUDA is not available. Ecosystem maturity is lower than NVIDIA or AMD ROCm.
What does “INT8 TOPS” mean and why is it missing for many cards?
INT8 TOPS (tera-operations per second at 8-bit integer precision) measures how fast a GPU can run quantised model inference. NVIDIA does not publish INT8 TOPS for GeForce consumer cards (it appears only on pro/data-center datasheets). Where shown for consumer NVIDIA cards, the figure is estimated. AMD does publish AI Accelerator INT8 TOPS on its product pages for RDNA 3 — notably, RDNA 3 delivers the same throughput for FP16 and INT8 matrix operations.
What is unified memory (Apple) and how does it compare to GPU VRAM?
Apple Silicon uses a unified memory architecture where the same physical memory pool is shared between CPU, GPU, and Neural Engine with no bus-transfer overhead. “VRAM GB” for Apple entries in this table reflects the maximum unified-memory configuration, not a discrete VRAM pool. This means an M4 Max with 128 GB can dedicate all 128 GB to model weights if needed — something no consumer discrete GPU offers. The trade-off is that memory bandwidth (546 GB/s for M4 Max vs. 1,008 GB/s for RTX 4090) is lower, so token generation on the same model is somewhat slower per watt.
What is the smallest GPU that can run a 13B model?
A 13B parameter model in FP16 requires approximately 26 GB of VRAM (13B × 2 bytes). At Q4 quantisation (~0.5 bytes/parameter), 13B fits in about 7 GB. So: a card with 8 GB VRAM (RTX 4060, RX 7600) can run a 13B model at Q4 quantisation but not at FP16. A 16 GB card (RTX 4080, RX 7900 XTX) handles 13B at FP16 comfortably. A 24 GB card handles 13B at FP16 with room to spare.
How often is this dataset updated?
The dataset was last verified in June 2026 against official manufacturer spec pages. GPU specifications are stable once a card is released, but new models are released regularly. We aim to add new entries and correct any discrepancies as they are identified. Check last_verified in the CSV/JSON and verify key specifications at the manufacturer’s source before purchasing.

Cite this dataset

This dataset is published under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence. You are free to share, adapt, and use this data for any purpose, including commercially, as long as you give appropriate credit.

APA
D-Central Technologies. (2026). AI & Local-Inference GPU Database (v1.0) [Dataset]. https://d-central.tech/data/ai-gpu-database/. CC BY 4.0.

Chicago
D-Central Technologies. “AI & Local-Inference GPU Database.” Version 1.0. Dataset. 2026. https://d-central.tech/data/ai-gpu-database/. CC BY 4.0.

BibTeX
@misc{dcentral2026gpudb,
  author = {{D-Central Technologies}},
  title = {AI & Local-Inference {GPU} Database},
  year = {2026},
  version = {1.0},
  howpublished = {url{https://d-central.tech/data/ai-gpu-database/}},
  note = {CC BY 4.0}
}

Machine-readable downloads: CSV · JSON · REST API

Specifications sourced from NVIDIA Corporation, AMD (Advanced Micro Devices), Apple Inc., and Intel Corporation official product pages and datasheets; cross-referenced with TechPowerUp GPU Database and WareDB. As of June 2026 — verify at source before purchasing decisions. Not financial or purchasing advice.