Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

AI Inference Accelerators Beyond GPUs: Can That NPU or Edge TPU Actually Run a Local LLM?

Every AI PC and edge board advertises a TOPS number, but most cannot run a real language model — because local LLM inference is bound by memory capacity and bandwidth, not raw TOPS. This reference gives the honest verdict for 13 non-GPU accelerators (AMD Ryzen AI, Snapdragon X, Jetson, Tenstorrent, Hailo, Coral, Groq): compute, the all-important memory model, runtimes and power.

Quick answer

Every "AI PC" and edge board now advertises a TOPS number, and most of them cannot run a real language model. The reason is simple and constantly missed: local LLM inference is bound by memory CAPACITY and BANDWIDTH, not raw TOPS. A 4-TOPS Coral TPU with 8 MB of SRAM and a vision-CNN heritage will never hold an LLM; a unified-memory AMD or Apple SoC with 128 GB will run a 70B model on one box. This reference gives the honest verdict for 13 non-GPU accelerators (8 are a credible local-LLM path) — compute, the all-important memory model, runtimes and power. GPUs and Apple Silicon live in our sibling AI-GPU database; here we cover the NPUs, edge TPUs, RISC-V AI cards and LPUs everything else ignores.

Read the memory column, not the TOPS: unified-memory SoCs (AMD Ryzen AI Max+, Jetson AGX Orin) and DRAM-backed open cards (Tenstorrent) genuinely run local LLMs; thin-SRAM edge parts (Coral Edge TPU, Hailo-8) are vision/CNN devices that cannot, no matter the marketing; and Groq's LPU is rack-scale cloud silicon, not local hardware (one chip holds 230 MB — you'd need ~576 to serve a 70B). NPU TOPS, SDKs and prices move every quarter (last verified 2026-06) — re-verify at the manufacturer before buying. The open-stack Tenstorrent cards are the option that advances local-AI sovereignty one layer further. Free CSV/JSON under CC BY 4.0; this is the "can it run a local LLM?" companion to our GPU database.

Download CSV Download JSON REST API →

AcceleratorCan it run a local LLM?Memory modelRuntimes
AMD Ryzen AI Max+ 395 (Strix Halo)
AMD · unified-memory APU/SoC
16 Zen 5 cores + 40 RDNA 3.5 iGPU CUs + XDNA 2 NPU ~50 TOPS (INT8)
Excellent
Excellent — runs Llama-70B Q8 on a single device via the iGPU + unified memory pool. NOTE: the 50-TOPS NPU is largely unused for general LLM inference as of mid-2026; the iGPU is the actual LLM path (NPU used only for specific accelerated paths via Lemonade SDK).
Standout consumer single-box 70B-class LLM host; shipping in mini-PCs (GMKtec EVO-X2, Beelink GTR9 Pro, AMD Ryzen AI Halo dev platform).
Up to 128 GB LPDDR5X-8000 unified; up to 96 GB assignable as VRAM (AMD Variable Graphics Memory)llama.cpp (Vulkan/ROCm), LM Studio, Ollama, AMD Lemonade SDK; OpenAI-compatible via Ollama/LM Studio server
Power: Configurable ~45-120 W (mini-PC/laptop)
AMD Ryzen AI 300 (Strix Point) XDNA 2 NPU
AMD · NPU (in x86 laptop SoC)
XDNA 2 NPU ~50 TOPS (INT8) + Zen 5 CPU + RDNA 3.5 iGPU
Emerging
Emerging — small models (Llama 3.1 8B, Phi 3.5 Mini) via Ryzen AI Software / Lemonade on the NPU; iGPU+RAM is the practical LLM path. Mainstream laptop tier, narrower memory than Strix Halo.
Distinguish from Max+ 395: same NPU class, far less memory bandwidth/capacity — much weaker for large LLMs.
Shared system LPDDR5X (dual-channel, typically 16-32 GB) — NOT the 256-bit wide pool of the Max+/Halo partAMD Ryzen AI Software, Lemonade SDK, ONNX Runtime; llama.cpp on iGPU
Power: ~15-54 W laptop envelope
NVIDIA Jetson AGX Orin 64 GB
NVIDIA · edge SoC (Ampere GPU + ARM CPU + dual DLA)
2048-core Ampere GPU + 64 Tensor cores + 12-core ARM; up to 275 TOPS (sparse INT8) / 170 dense INT8 TOPS; 5.3 FP32 TFLOPs
Strong
Strong — runs 13B unquantized and larger quantized models via CUDA; large unified pool is the differentiator over smaller Jetsons.
Credit-card-sized edge module; popular self-hosted always-on LLM/robotics node.
64 GB 256-bit LPDDR5 unified, 204.8 GB/sllama.cpp, Ollama, NVIDIA TensorRT-LLM, MLC-LLM (full CUDA stack)
Power: 15 W / 30 W / 50 W presets, up to 60 W MAXN
NVIDIA Jetson Orin Nano Super 8 GB
NVIDIA · edge SoC (Ampere GPU + ARM CPU)
Up to 67 TOPS (INT8)
Limited
Limited — small 3B-8B quantized models only; 8 GB ceiling. Budget entry to the Jetson/CUDA LLM ecosystem.
Cheapest CUDA-capable on-device LLM box; good for tinkering, not large models.
8 GB LPDDR5 unified (also 4 GB variant)llama.cpp, Ollama, MLC-LLM, TensorRT-LLM
Power: 7-25 W
Qualcomm Snapdragon X Elite (Hexagon NPU)
Qualcomm · NPU (in ARM laptop SoC)
Hexagon NPU 45 TOPS (INT8) + 12-core Oryon ARMv9 CPU + Adreno GPU
Good
Good / Emerging — runs 8B-13B-class models on-device; NPU optimized for INT4/INT8 (not FP16/BF16). Software maturity is the gating factor.
Windows-on-ARM AI PCs; community testing confirms usable power-efficient local LLMs on the NPU.
Unified LPDDR5X up to 64 GB (commonly 16 GB at 8533 MT/s)llama.cpp, Ollama (CPU/NPU paths maturing); Qualcomm AI Engine / QNN
Power: Very efficient laptop envelope (NPU is fraction of iGPU power)
Intel Core Ultra Series 2 (Lunar Lake) NPU
Intel · NPU (NPU 4, in x86 SoC)
NPU up to 48 TOPS (INT8) + P/E cores + Xe2 (Arc) iGPU; combined platform ~120 TOPS
Emerging
Emerging / Moderate — 7B-8B models via OpenVINO GenAI on NPU (NF4 supported); ~8 tok/s combining CPU+NPU+GPU. NPU draws ~2-3 W vs 15-25 W iGPU → best for efficiency, not peak throughput. Prompts >1024 tokens with >7B models may need >16 GB RAM.
Strongest case is battery-efficient assistant loops, not large-model throughput.
On-package LPDDR5X (16 GB or 32 GB on Lunar Lake — capacity is the constraint)Intel OpenVINO / OpenVINO GenAI (NPU plugin), WindowsML, DirectML, ONNX RT, WebNN
Power: Laptop envelope; NPU ~2-3 W
Tenstorrent Blackhole p150a
Tenstorrent · AI accelerator PCIe card (RISC-V Tensix)
140 Tensix cores + 16 'big RISC-V' cores; 210 MB SRAM
Good
Good — purpose-built local LLM inference card; runs Llama/Qwen/Mistral/Mixtral/Falcon via Tenstorrent's open-source vLLM fork. QSFP-DD ports allow multi-card memory pooling.
~$1,399. Fully open-source software stack — aligns with the decentralization/sovereignty narrative; a credible non-NVIDIA local-LLM card.
32 GB GDDR6 @ 512 GB/s (DRAM-backed)Tenstorrent TT-Metalium / tt-vLLM (open source, OpenAI-compatible server); access 'down to the metal'
Power: Up to 300 W, active-cooled
Tenstorrent Wormhole n300d
Tenstorrent · AI accelerator PCIe card (RISC-V Tensix, dual ASIC)
2x Wormhole ASICs, 128 Tensix cores; 192 MB SRAM
Good
Good — prior-gen open-stack local LLM card; same TT software ecosystem as Blackhole.
~$1,449. Predecessor to Blackhole (credit Tenstorrent's iterative open-hardware lineage).
24 GB GDDR6 @ 576 GB/s (DRAM-backed)TT-Metalium / tt-vLLM (open source, OpenAI-compatible)
Power: Up to 300 W
Hailo-10H
Hailo · edge generative-AI accelerator (M.2)
40 TOPS (INT4) / 20 TOPS (INT8), 2nd-gen neural core
Limited
Limited / Emerging — first Hailo part that CAN run small LLMs/VLMs/diffusion at the edge (the direct-DDR interface lifts the on-die-SRAM cap); capacity-limited to small models. Tested on Raspberry Pi AI HAT+ 2.
Genuinely runs gen-AI at the edge but is a small-model, vendor-SDK device — not a general OpenAI-API LLM host.
Direct DDR interface to on-module LPDDR4/4X, 4 GB or 8 GBHailo Dataflow Compiler / HailoRT SDK (vendor stack)
Power: ~2.5 W typical
Hailo-8
Hailo · edge vision accelerator (M.2/PCIe)
Up to 26 TOPS (INT8)
Not for LLMs
Not suitable (vision/CNN only) — no DRAM path; cannot hold LLM weights. Designed for vision networks. Included to correct a common misconception.
Market-leading edge vision accelerator; NOT an LLM device. The Hailo-10H (above) is the gen-AI successor.
All weights on-die SRAM — NO external memory interface (hard cap on model size)HailoRT (vision pipelines: detection/segmentation/classification)
Power: Low single-digit W
Google Coral Edge TPU (USB / M.2 / Dev Board)
Google · edge TPU (vision coprocessor)
4 TOPS (INT8), 2 TOPS/W
Not for LLMs
Not suitable (vision/CNN only) — built for the convolutional vision era (e.g. MobileNet v2 ~400 fps); never designed for language models, no memory to hold LLM weights.
Frequently mis-asked-about for LLMs; the answer is no. (Google's newer 'Coralboard' with a transformer-capable NPU is a separate, distinct product.)
~8 MB on-chip SRAM; TensorFlow Lite INT8 models onlyTensorFlow Lite (Edge TPU compiler) — vision models
Power: ~2 W
Groq LPU (GroqCard)
Groq · datacenter inference ASIC (Language Processing Unit)
Deterministic dataflow architecture
Not local
Not local (cloud/datacenter only) — accessed via GroqCloud token API; rack-scale only. Listed to clarify it is not home/local hardware.
~$20k/card and useless in isolation; belongs to the cloud-inference economics story, not local hardware.
230 MB on-chip SRAM @ ~80 TB/s, NO DRAM — a single chip cannot hold even a small model; ~576 LPUs needed to serve Llama-2-70BGroqCloud API (OpenAI-compatible endpoint) — service, not local device
Power: Datacenter card, rack-scale
Apple Silicon M-series (M3/M4 Max/Pro) — see ai-gpu-database
Apple · unified-memory SoC (POINTER ROW — not duplicated here)
Up to 40-core GPU + 16-core Apple Neural Engine (ANE)
Excellent
Excellent — but the LLM path is the GPU via Metal/MLX/llama.cpp, NOT the Apple Neural Engine (ANE is used for Core ML vision/system tasks, not general LLM decode). Full specs live in the sibling GPU dataset to avoid duplication.
CROSS-LINK ONLY — full rows (apple-m4-max-128gb, apple-m3-max-128gb, apple-m4-pro-64gb) already in /data/ai-gpu-database/. Present here purely to disambiguate ANE vs GPU and keep the two datasets explicitly connected.
Up to 128 GB unified memory — the reference standard for local LLM on a SoCMLX, llama.cpp (Metal), Ollama, LM Studio
Power: Laptop/desktop envelope

Sources: manufacturer / first-party docs (full source URL + last_verified per row in the CSV/JSON). The "can it run a local LLM?" companion to the AI-GPU hardware database (GPUs + Apple Silicon) and the local AI runtimes. Part of the Sovereign AI stack.