AI Inference Accelerators Beyond GPUs: Can That NPU or Edge TPU Actually Run a Local LLM?

Every AI PC and edge board advertises a TOPS number, but most cannot run a real language model — because local LLM inference is bound by memory capacity and bandwidth, not raw TOPS. This reference gives the honest verdict for 13 non-GPU accelerators (AMD Ryzen AI, Snapdragon X, Jetson, Tenstorrent, Hailo, Coral, Groq): compute, the all-important memory model, runtimes and power.

Quick answer

Every "AI PC" and edge board now advertises a TOPS number, and most of them cannot run a real language model. The reason is simple and constantly missed: local LLM inference is bound by memory CAPACITY and BANDWIDTH, not raw TOPS. A 4-TOPS Coral TPU with 8 MB of SRAM and a vision-CNN heritage will never hold an LLM; a unified-memory AMD or Apple SoC with 128 GB will run a 70B model on one box. This reference gives the honest verdict for 13 non-GPU accelerators (8 are a credible local-LLM path) — compute, the all-important memory model, runtimes and power. GPUs and Apple Silicon live in our sibling AI-GPU database; here we cover the NPUs, edge TPUs, RISC-V AI cards and LPUs everything else ignores.

Read the memory column, not the TOPS: unified-memory SoCs (AMD Ryzen AI Max+, Jetson AGX Orin) and DRAM-backed open cards (Tenstorrent) genuinely run local LLMs; thin-SRAM edge parts (Coral Edge TPU, Hailo-8) are vision/CNN devices that cannot, no matter the marketing; and Groq's LPU is rack-scale cloud silicon, not local hardware (one chip holds 230 MB — you'd need ~576 to serve a 70B). NPU TOPS, SDKs and prices move every quarter (last verified 2026-06) — re-verify at the manufacturer before buying. The open-stack Tenstorrent cards are the option that advances local-AI sovereignty one layer further. Free CSV/JSON under CC BY 4.0; this is the "can it run a local LLM?" companion to our GPU database.

Download CSV Download JSON REST API →

Accelerator	Can it run a local LLM?	Memory model	Runtimes
AMD Ryzen AI Max+ 395 (Strix Halo) AMD · unified-memory APU/SoC 16 Zen 5 cores + 40 RDNA 3.5 iGPU CUs + XDNA 2 NPU ~50 TOPS (INT8)	Excellent Excellent — runs Llama-70B Q8 on a single device via the iGPU + unified memory pool. NOTE: the 50-TOPS NPU is largely unused for general LLM inference as of mid-2026; the iGPU is the actual LLM path (NPU used only for specific accelerated paths via Lemonade SDK). Standout consumer single-box 70B-class LLM host; shipping in mini-PCs (GMKtec EVO-X2, Beelink GTR9 Pro, AMD Ryzen AI Halo dev platform).	Up to 128 GB LPDDR5X-8000 unified; up to 96 GB assignable as VRAM (AMD Variable Graphics Memory)	llama.cpp (Vulkan/ROCm), LM Studio, Ollama, AMD Lemonade SDK; OpenAI-compatible via Ollama/LM Studio server Power: Configurable ~45-120 W (mini-PC/laptop)
AMD Ryzen AI 300 (Strix Point) XDNA 2 NPU AMD · NPU (in x86 laptop SoC) XDNA 2 NPU ~50 TOPS (INT8) + Zen 5 CPU + RDNA 3.5 iGPU	Emerging Emerging — small models (Llama 3.1 8B, Phi 3.5 Mini) via Ryzen AI Software / Lemonade on the NPU; iGPU+RAM is the practical LLM path. Mainstream laptop tier, narrower memory than Strix Halo. Distinguish from Max+ 395: same NPU class, far less memory bandwidth/capacity — much weaker for large LLMs.	Shared system LPDDR5X (dual-channel, typically 16-32 GB) — NOT the 256-bit wide pool of the Max+/Halo part	AMD Ryzen AI Software, Lemonade SDK, ONNX Runtime; llama.cpp on iGPU Power: ~15-54 W laptop envelope
NVIDIA Jetson AGX Orin 64 GB NVIDIA · edge SoC (Ampere GPU + ARM CPU + dual DLA) 2048-core Ampere GPU + 64 Tensor cores + 12-core ARM; up to 275 TOPS (sparse INT8) / 170 dense INT8 TOPS; 5.3 FP32 TFLOPs	Strong Strong — runs 13B unquantized and larger quantized models via CUDA; large unified pool is the differentiator over smaller Jetsons. Credit-card-sized edge module; popular self-hosted always-on LLM/robotics node.	64 GB 256-bit LPDDR5 unified, 204.8 GB/s	llama.cpp, Ollama, NVIDIA TensorRT-LLM, MLC-LLM (full CUDA stack) Power: 15 W / 30 W / 50 W presets, up to 60 W MAXN
NVIDIA Jetson Orin Nano Super 8 GB NVIDIA · edge SoC (Ampere GPU + ARM CPU) Up to 67 TOPS (INT8)	Limited Limited — small 3B-8B quantized models only; 8 GB ceiling. Budget entry to the Jetson/CUDA LLM ecosystem. Cheapest CUDA-capable on-device LLM box; good for tinkering, not large models.	8 GB LPDDR5 unified (also 4 GB variant)	llama.cpp, Ollama, MLC-LLM, TensorRT-LLM Power: 7-25 W
Qualcomm Snapdragon X Elite (Hexagon NPU) Qualcomm · NPU (in ARM laptop SoC) Hexagon NPU 45 TOPS (INT8) + 12-core Oryon ARMv9 CPU + Adreno GPU	Good Good / Emerging — runs 8B-13B-class models on-device; NPU optimized for INT4/INT8 (not FP16/BF16). Software maturity is the gating factor. Windows-on-ARM AI PCs; community testing confirms usable power-efficient local LLMs on the NPU.	Unified LPDDR5X up to 64 GB (commonly 16 GB at 8533 MT/s)	llama.cpp, Ollama (CPU/NPU paths maturing); Qualcomm AI Engine / QNN Power: Very efficient laptop envelope (NPU is fraction of iGPU power)
Intel Core Ultra Series 2 (Lunar Lake) NPU Intel · NPU (NPU 4, in x86 SoC) NPU up to 48 TOPS (INT8) + P/E cores + Xe2 (Arc) iGPU; combined platform ~120 TOPS	Emerging Emerging / Moderate — 7B-8B models via OpenVINO GenAI on NPU (NF4 supported); ~8 tok/s combining CPU+NPU+GPU. NPU draws ~2-3 W vs 15-25 W iGPU → best for efficiency, not peak throughput. Prompts >1024 tokens with >7B models may need >16 GB RAM. Strongest case is battery-efficient assistant loops, not large-model throughput.	On-package LPDDR5X (16 GB or 32 GB on Lunar Lake — capacity is the constraint)	Intel OpenVINO / OpenVINO GenAI (NPU plugin), WindowsML, DirectML, ONNX RT, WebNN Power: Laptop envelope; NPU ~2-3 W
Tenstorrent Blackhole p150a Tenstorrent · AI accelerator PCIe card (RISC-V Tensix) 140 Tensix cores + 16 'big RISC-V' cores; 210 MB SRAM	Good Good — purpose-built local LLM inference card; runs Llama/Qwen/Mistral/Mixtral/Falcon via Tenstorrent's open-source vLLM fork. QSFP-DD ports allow multi-card memory pooling. ~$1,399. Fully open-source software stack — aligns with the decentralization/sovereignty narrative; a credible non-NVIDIA local-LLM card.	32 GB GDDR6 @ 512 GB/s (DRAM-backed)	Tenstorrent TT-Metalium / tt-vLLM (open source, OpenAI-compatible server); access 'down to the metal' Power: Up to 300 W, active-cooled
Tenstorrent Wormhole n300d Tenstorrent · AI accelerator PCIe card (RISC-V Tensix, dual ASIC) 2x Wormhole ASICs, 128 Tensix cores; 192 MB SRAM	Good Good — prior-gen open-stack local LLM card; same TT software ecosystem as Blackhole. ~$1,449. Predecessor to Blackhole (credit Tenstorrent's iterative open-hardware lineage).	24 GB GDDR6 @ 576 GB/s (DRAM-backed)	TT-Metalium / tt-vLLM (open source, OpenAI-compatible) Power: Up to 300 W
Hailo-10H Hailo · edge generative-AI accelerator (M.2) 40 TOPS (INT4) / 20 TOPS (INT8), 2nd-gen neural core	Limited Limited / Emerging — first Hailo part that CAN run small LLMs/VLMs/diffusion at the edge (the direct-DDR interface lifts the on-die-SRAM cap); capacity-limited to small models. Tested on Raspberry Pi AI HAT+ 2. Genuinely runs gen-AI at the edge but is a small-model, vendor-SDK device — not a general OpenAI-API LLM host.	Direct DDR interface to on-module LPDDR4/4X, 4 GB or 8 GB	Hailo Dataflow Compiler / HailoRT SDK (vendor stack) Power: ~2.5 W typical
Hailo-8 Hailo · edge vision accelerator (M.2/PCIe) Up to 26 TOPS (INT8)	Not for LLMs Not suitable (vision/CNN only) — no DRAM path; cannot hold LLM weights. Designed for vision networks. Included to correct a common misconception. Market-leading edge vision accelerator; NOT an LLM device. The Hailo-10H (above) is the gen-AI successor.	All weights on-die SRAM — NO external memory interface (hard cap on model size)	HailoRT (vision pipelines: detection/segmentation/classification) Power: Low single-digit W
Google Coral Edge TPU (USB / M.2 / Dev Board) Google · edge TPU (vision coprocessor) 4 TOPS (INT8), 2 TOPS/W	Not for LLMs Not suitable (vision/CNN only) — built for the convolutional vision era (e.g. MobileNet v2 ~400 fps); never designed for language models, no memory to hold LLM weights. Frequently mis-asked-about for LLMs; the answer is no. (Google's newer 'Coralboard' with a transformer-capable NPU is a separate, distinct product.)	~8 MB on-chip SRAM; TensorFlow Lite INT8 models only	TensorFlow Lite (Edge TPU compiler) — vision models Power: ~2 W
Groq LPU (GroqCard) Groq · datacenter inference ASIC (Language Processing Unit) Deterministic dataflow architecture	Not local Not local (cloud/datacenter only) — accessed via GroqCloud token API; rack-scale only. Listed to clarify it is not home/local hardware. ~$20k/card and useless in isolation; belongs to the cloud-inference economics story, not local hardware.	230 MB on-chip SRAM @ ~80 TB/s, NO DRAM — a single chip cannot hold even a small model; ~576 LPUs needed to serve Llama-2-70B	GroqCloud API (OpenAI-compatible endpoint) — service, not local device Power: Datacenter card, rack-scale
Apple Silicon M-series (M3/M4 Max/Pro) — see ai-gpu-database Apple · unified-memory SoC (POINTER ROW — not duplicated here) Up to 40-core GPU + 16-core Apple Neural Engine (ANE)	Excellent Excellent — but the LLM path is the GPU via Metal/MLX/llama.cpp, NOT the Apple Neural Engine (ANE is used for Core ML vision/system tasks, not general LLM decode). Full specs live in the sibling GPU dataset to avoid duplication. CROSS-LINK ONLY — full rows (apple-m4-max-128gb, apple-m3-max-128gb, apple-m4-pro-64gb) already in /data/ai-gpu-database/. Present here purely to disambiguate ANE vs GPU and keep the two datasets explicitly connected.	Up to 128 GB unified memory — the reference standard for local LLM on a SoC	MLX, llama.cpp (Metal), Ollama, LM Studio Power: Laptop/desktop envelope

Sources: manufacturer / first-party docs (full source URL + last_verified per row in the CSV/JSON). The "can it run a local LLM?" companion to the AI-GPU hardware database (GPUs + Apple Silicon) and the local AI runtimes. Part of the Sovereign AI stack.

Related products, repair, and setup paths

Last reviewed June 21, 2026.