AI Inference Accelerators Beyond GPUs: Can That NPU or Edge TPU Actually Run a Local LLM?
Every AI PC and edge board advertises a TOPS number, but most cannot run a real language model — because local LLM inference is bound by memory capacity and bandwidth, not raw TOPS. This reference gives the honest verdict for 13 non-GPU accelerators (AMD Ryzen AI, Snapdragon X, Jetson, Tenstorrent, Hailo, Coral, Groq): compute, the all-important memory model, runtimes and power.
Quick answer
Every "AI PC" and edge board now advertises a TOPS number, and most of them cannot run a real language model. The reason is simple and constantly missed: local LLM inference is bound by memory CAPACITY and BANDWIDTH, not raw TOPS. A 4-TOPS Coral TPU with 8 MB of SRAM and a vision-CNN heritage will never hold an LLM; a unified-memory AMD or Apple SoC with 128 GB will run a 70B model on one box. This reference gives the honest verdict for 13 non-GPU accelerators (8 are a credible local-LLM path) — compute, the all-important memory model, runtimes and power. GPUs and Apple Silicon live in our sibling AI-GPU database; here we cover the NPUs, edge TPUs, RISC-V AI cards and LPUs everything else ignores.
Read the memory column, not the TOPS: unified-memory SoCs (AMD Ryzen AI Max+, Jetson AGX Orin) and DRAM-backed open cards (Tenstorrent) genuinely run local LLMs; thin-SRAM edge parts (Coral Edge TPU, Hailo-8) are vision/CNN devices that cannot, no matter the marketing; and Groq's LPU is rack-scale cloud silicon, not local hardware (one chip holds 230 MB — you'd need ~576 to serve a 70B). NPU TOPS, SDKs and prices move every quarter (last verified 2026-06) — re-verify at the manufacturer before buying. The open-stack Tenstorrent cards are the option that advances local-AI sovereignty one layer further. Free CSV/JSON under CC BY 4.0; this is the "can it run a local LLM?" companion to our GPU database.
Download CSV Download JSON REST API →
| Accelerator | Can it run a local LLM? | Memory model | Runtimes |
|---|---|---|---|
| AMD Ryzen AI Max+ 395 (Strix Halo) AMD · unified-memory APU/SoC 16 Zen 5 cores + 40 RDNA 3.5 iGPU CUs + XDNA 2 NPU ~50 TOPS (INT8) | Excellent Excellent — runs Llama-70B Q8 on a single device via the iGPU + unified memory pool. NOTE: the 50-TOPS NPU is largely unused for general LLM inference as of mid-2026; the iGPU is the actual LLM path (NPU used only for specific accelerated paths via Lemonade SDK). Standout consumer single-box 70B-class LLM host; shipping in mini-PCs (GMKtec EVO-X2, Beelink GTR9 Pro, AMD Ryzen AI Halo dev platform). | Up to 128 GB LPDDR5X-8000 unified; up to 96 GB assignable as VRAM (AMD Variable Graphics Memory) | llama.cpp (Vulkan/ROCm), LM Studio, Ollama, AMD Lemonade SDK; OpenAI-compatible via Ollama/LM Studio server Power: Configurable ~45-120 W (mini-PC/laptop) |
| AMD Ryzen AI 300 (Strix Point) XDNA 2 NPU AMD · NPU (in x86 laptop SoC) XDNA 2 NPU ~50 TOPS (INT8) + Zen 5 CPU + RDNA 3.5 iGPU | Emerging Emerging — small models (Llama 3.1 8B, Phi 3.5 Mini) via Ryzen AI Software / Lemonade on the NPU; iGPU+RAM is the practical LLM path. Mainstream laptop tier, narrower memory than Strix Halo. Distinguish from Max+ 395: same NPU class, far less memory bandwidth/capacity — much weaker for large LLMs. | Shared system LPDDR5X (dual-channel, typically 16-32 GB) — NOT the 256-bit wide pool of the Max+/Halo part | AMD Ryzen AI Software, Lemonade SDK, ONNX Runtime; llama.cpp on iGPU Power: ~15-54 W laptop envelope |
| NVIDIA Jetson AGX Orin 64 GB NVIDIA · edge SoC (Ampere GPU + ARM CPU + dual DLA) 2048-core Ampere GPU + 64 Tensor cores + 12-core ARM; up to 275 TOPS (sparse INT8) / 170 dense INT8 TOPS; 5.3 FP32 TFLOPs | Strong Strong — runs 13B unquantized and larger quantized models via CUDA; large unified pool is the differentiator over smaller Jetsons. Credit-card-sized edge module; popular self-hosted always-on LLM/robotics node. | 64 GB 256-bit LPDDR5 unified, 204.8 GB/s | llama.cpp, Ollama, NVIDIA TensorRT-LLM, MLC-LLM (full CUDA stack) Power: 15 W / 30 W / 50 W presets, up to 60 W MAXN |
| NVIDIA Jetson Orin Nano Super 8 GB NVIDIA · edge SoC (Ampere GPU + ARM CPU) Up to 67 TOPS (INT8) | Limited Limited — small 3B-8B quantized models only; 8 GB ceiling. Budget entry to the Jetson/CUDA LLM ecosystem. Cheapest CUDA-capable on-device LLM box; good for tinkering, not large models. | 8 GB LPDDR5 unified (also 4 GB variant) | llama.cpp, Ollama, MLC-LLM, TensorRT-LLM Power: 7-25 W |
| Qualcomm Snapdragon X Elite (Hexagon NPU) Qualcomm · NPU (in ARM laptop SoC) Hexagon NPU 45 TOPS (INT8) + 12-core Oryon ARMv9 CPU + Adreno GPU | Good Good / Emerging — runs 8B-13B-class models on-device; NPU optimized for INT4/INT8 (not FP16/BF16). Software maturity is the gating factor. Windows-on-ARM AI PCs; community testing confirms usable power-efficient local LLMs on the NPU. | Unified LPDDR5X up to 64 GB (commonly 16 GB at 8533 MT/s) | llama.cpp, Ollama (CPU/NPU paths maturing); Qualcomm AI Engine / QNN Power: Very efficient laptop envelope (NPU is fraction of iGPU power) |
| Intel Core Ultra Series 2 (Lunar Lake) NPU Intel · NPU (NPU 4, in x86 SoC) NPU up to 48 TOPS (INT8) + P/E cores + Xe2 (Arc) iGPU; combined platform ~120 TOPS | Emerging Emerging / Moderate — 7B-8B models via OpenVINO GenAI on NPU (NF4 supported); ~8 tok/s combining CPU+NPU+GPU. NPU draws ~2-3 W vs 15-25 W iGPU → best for efficiency, not peak throughput. Prompts >1024 tokens with >7B models may need >16 GB RAM. Strongest case is battery-efficient assistant loops, not large-model throughput. | On-package LPDDR5X (16 GB or 32 GB on Lunar Lake — capacity is the constraint) | Intel OpenVINO / OpenVINO GenAI (NPU plugin), WindowsML, DirectML, ONNX RT, WebNN Power: Laptop envelope; NPU ~2-3 W |
| Tenstorrent Blackhole p150a Tenstorrent · AI accelerator PCIe card (RISC-V Tensix) 140 Tensix cores + 16 'big RISC-V' cores; 210 MB SRAM | Good Good — purpose-built local LLM inference card; runs Llama/Qwen/Mistral/Mixtral/Falcon via Tenstorrent's open-source vLLM fork. QSFP-DD ports allow multi-card memory pooling. ~$1,399. Fully open-source software stack — aligns with the decentralization/sovereignty narrative; a credible non-NVIDIA local-LLM card. | 32 GB GDDR6 @ 512 GB/s (DRAM-backed) | Tenstorrent TT-Metalium / tt-vLLM (open source, OpenAI-compatible server); access 'down to the metal' Power: Up to 300 W, active-cooled |
| Tenstorrent Wormhole n300d Tenstorrent · AI accelerator PCIe card (RISC-V Tensix, dual ASIC) 2x Wormhole ASICs, 128 Tensix cores; 192 MB SRAM | Good Good — prior-gen open-stack local LLM card; same TT software ecosystem as Blackhole. ~$1,449. Predecessor to Blackhole (credit Tenstorrent's iterative open-hardware lineage). | 24 GB GDDR6 @ 576 GB/s (DRAM-backed) | TT-Metalium / tt-vLLM (open source, OpenAI-compatible) Power: Up to 300 W |
| Hailo-10H Hailo · edge generative-AI accelerator (M.2) 40 TOPS (INT4) / 20 TOPS (INT8), 2nd-gen neural core | Limited Limited / Emerging — first Hailo part that CAN run small LLMs/VLMs/diffusion at the edge (the direct-DDR interface lifts the on-die-SRAM cap); capacity-limited to small models. Tested on Raspberry Pi AI HAT+ 2. Genuinely runs gen-AI at the edge but is a small-model, vendor-SDK device — not a general OpenAI-API LLM host. | Direct DDR interface to on-module LPDDR4/4X, 4 GB or 8 GB | Hailo Dataflow Compiler / HailoRT SDK (vendor stack) Power: ~2.5 W typical |
| Hailo-8 Hailo · edge vision accelerator (M.2/PCIe) Up to 26 TOPS (INT8) | Not for LLMs Not suitable (vision/CNN only) — no DRAM path; cannot hold LLM weights. Designed for vision networks. Included to correct a common misconception. Market-leading edge vision accelerator; NOT an LLM device. The Hailo-10H (above) is the gen-AI successor. | All weights on-die SRAM — NO external memory interface (hard cap on model size) | HailoRT (vision pipelines: detection/segmentation/classification) Power: Low single-digit W |
| Google Coral Edge TPU (USB / M.2 / Dev Board) Google · edge TPU (vision coprocessor) 4 TOPS (INT8), 2 TOPS/W | Not for LLMs Not suitable (vision/CNN only) — built for the convolutional vision era (e.g. MobileNet v2 ~400 fps); never designed for language models, no memory to hold LLM weights. Frequently mis-asked-about for LLMs; the answer is no. (Google's newer 'Coralboard' with a transformer-capable NPU is a separate, distinct product.) | ~8 MB on-chip SRAM; TensorFlow Lite INT8 models only | TensorFlow Lite (Edge TPU compiler) — vision models Power: ~2 W |
| Groq LPU (GroqCard) Groq · datacenter inference ASIC (Language Processing Unit) Deterministic dataflow architecture | Not local Not local (cloud/datacenter only) — accessed via GroqCloud token API; rack-scale only. Listed to clarify it is not home/local hardware. ~$20k/card and useless in isolation; belongs to the cloud-inference economics story, not local hardware. | 230 MB on-chip SRAM @ ~80 TB/s, NO DRAM — a single chip cannot hold even a small model; ~576 LPUs needed to serve Llama-2-70B | GroqCloud API (OpenAI-compatible endpoint) — service, not local device Power: Datacenter card, rack-scale |
| Apple Silicon M-series (M3/M4 Max/Pro) — see ai-gpu-database Apple · unified-memory SoC (POINTER ROW — not duplicated here) Up to 40-core GPU + 16-core Apple Neural Engine (ANE) | Excellent Excellent — but the LLM path is the GPU via Metal/MLX/llama.cpp, NOT the Apple Neural Engine (ANE is used for Core ML vision/system tasks, not general LLM decode). Full specs live in the sibling GPU dataset to avoid duplication. CROSS-LINK ONLY — full rows (apple-m4-max-128gb, apple-m3-max-128gb, apple-m4-pro-64gb) already in /data/ai-gpu-database/. Present here purely to disambiguate ANE vs GPU and keep the two datasets explicitly connected. | Up to 128 GB unified memory — the reference standard for local LLM on a SoC | MLX, llama.cpp (Metal), Ollama, LM Studio Power: Laptop/desktop envelope |
Sources: manufacturer / first-party docs (full source URL + last_verified per row in the CSV/JSON). The "can it run a local LLM?" companion to the AI-GPU hardware database (GPUs + Apple Silicon) and the local AI runtimes. Part of the Sovereign AI stack.
Related products, repair, and setup paths
- how D-Central diagnoses ASIC repairs
- ASIC troubleshooting library
- ASIC manuals and repair guides
- replacement hashboards
- ASIC control boards
- ASIC power supplies
- S19 family replacement hashboard
- C52 replacement control board
- APW12 S19 power supply
- immersion cooling hub
- home immersion cooling guide
- ASIC miners for immersion planning
- ASIC cooling parts
- airflow shroud before immersion
- compare miner specs in the database
- ASIC repair support
Last reviewed June 21, 2026.
