Local AI Runtime Comparison: Ollama vs llama.cpp vs vLLM and more
Quick answer
This compares 16 local AI inference runtimes -- the software you run open-weight LLMs with on your own hardware, no cloud API required. For each it lists the category (engine, desktop app, server or frontend), how you interact with it, which model formats it loads, whether it exposes an OpenAI-compatible API (14 do, so you can point existing apps at your own machine), GPU support, license and OS.
There is no single best runtime: pick by your interface (CLI vs desktop GUI vs server), your hardware (NVIDIA vs Apple Silicon vs CPU) and your license needs. Many of these wrap llama.cpp under the hood, so GGUF is the common model format. An OpenAI-compatible API is the key to sovereignty -- it lets you swap a cloud endpoint for your own without rewriting your apps. D-Central is one node in this ecosystem; verify the current details against each project repo. Free CSV/JSON under CC BY 4.0.
Download CSV Download JSON REST API →
| Runtime | Category | Interface | Formats | OpenAI API | GPU | License |
|---|---|---|---|---|---|---|
| Ollama simplest local model runner | engine | CLI + server | GGUF (can import safetensors) | Yes | NVIDIA (CUDA), AMD (ROCm), Apple (Metal) | MIT |
| llama.cpp underlying GGUF inference engine many tools wrap | engine | CLI + server (also a C/C++ library) | GGUF | Yes | NVIDIA (CUDA), AMD (ROCm), Apple (Metal), Vulkan, CPU | MIT |
| LM Studio non-technical desktop users | app | Desktop GUI (+ local server, CLI 'lms') | GGUF, MLX | Yes | NVIDIA (CUDA), AMD (ROCm/Vulkan), Apple (Metal) | Closed (proprietary); free for personal use |
| vLLM max-throughput GPU serving | server | Server (API) + Python library | HF safetensors, GPTQ, AWQ | Yes | NVIDIA (CUDA) first, AMD (ROCm) | Apache-2.0 |
| text-generation-webui (oobabooga) power users wanting many backends and quant formats | app | Web UI (+ API) | GGUF, EXL2, GPTQ, AWQ, HF safetensors | Yes | NVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPU | AGPL-3.0 |
| Jan open-source offline desktop ChatGPT alternative | app | Desktop GUI (+ local server) | GGUF (llama.cpp engine) | Yes | NVIDIA (CUDA), AMD (Vulkan), Apple (Metal) | Apache-2.0 |
| GPT4All privacy-focused desktop chat with local documents | app | Desktop GUI (+ local API server) | GGUF | Yes | NVIDIA/AMD (Vulkan), Apple (Metal), CPU | MIT |
| KoboldCpp single-binary story writing and roleplay | app | GUI launcher + Web UI + Server (API) | GGUF | Yes | NVIDIA (CUDA), AMD (Vulkan), Apple (Metal), CPU | AGPL-3.0 |
| LocalAI self-hosted drop-in OpenAI API replacement | server | Server (API) | GGUF (plus multimodal/whisper/diffusers backends) | Yes | NVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPU | MIT |
| Llamafile single portable file that runs across OSes | engine | CLI + server (single-file executable) | GGUF | Yes | NVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPU | Apache-2.0 |
| Open WebUI self-hosted chat UI in front of Ollama or OpenAI-compatible APIs | frontend | Web UI | N/A (uses a backend) | No | N/A (backend-dependent) | BSD-3-Clause (v0.6.6+ adds a branding clause + CLA) |
| MLX / mlx-lm Apple Silicon native inference | library | Python library + CLI | MLX (converts from HF safetensors) | Yes | Apple (Metal) only | MIT |
| Hugging Face TGI production serving of Hugging Face models | server | Server (API) | HF safetensors, GPTQ, AWQ | Yes | NVIDIA (CUDA) first, AMD (ROCm), Intel Gaudi | Apache-2.0 |
| ExLlamaV2 / TabbyAPI fast EXL2 quantized inference on NVIDIA GPUs | engine | Python library + Server (API via TabbyAPI) | EXL2, GPTQ | Yes | NVIDIA (CUDA) | MIT (ExLlamaV2); AGPL-3.0 (TabbyAPI) |
| SGLang high-throughput structured/agentic GPU serving | server | Server (API) + Python library | HF safetensors, GPTQ, AWQ, FP8 | Yes | NVIDIA (CUDA) first, AMD (ROCm) | Apache-2.0 |
| AnythingLLM all-in-one local RAG desktop app | app | Desktop GUI + Web UI (Docker) | GGUF (built-in engine; also connects to external providers) | No | NVIDIA (CUDA), Apple (Metal), CPU | MIT |
Source: each project's own repository and docs. Pairs with the local-LLM VRAM calculator, the self-hosting hub and the sovereign self-hosting catalog. Local-AI tooling moves fast — confirm details against the project repo.
Related products, repair, and setup paths
- self-hosted AI for Bitcoiners hub
- plebs guide to self-hosted AI
- install Ollama in 10 minutes
- LM Studio vs Ollama vs llama.cpp
- connect local AI to Home Assistant and Obsidian
- self-hosted AI troubleshooting
- repurpose mining hardware into an AI hashcenter
- local AI model leaderboards
Last reviewed June 20, 2026.
