Local AI Runtime Comparison: Ollama vs llama.cpp vs vLLM and more

This compares the runtimes that run a local LLM — the engines behind the model. Once a runtime can serve a model, the next question is what that model can do: for the tool servers that give a local agent real capabilities (files, git, browser, hardware design), see the self-hosted MCP server directory.

Quick answer

This compares 16 local AI inference runtimes -- the software you run open-weight LLMs with on your own hardware, no cloud API required. For each it lists the category (engine, desktop app, server or frontend), how you interact with it, which model formats it loads, whether it exposes an OpenAI-compatible API (14 do, so you can point existing apps at your own machine), GPU support, license and OS.

There is no single best runtime: pick by your interface (CLI vs desktop GUI vs server), your hardware (NVIDIA vs Apple Silicon vs CPU) and your license needs. Many of these wrap llama.cpp under the hood, so GGUF is the common model format. An OpenAI-compatible API is the key to sovereignty -- it lets you swap a cloud endpoint for your own without rewriting your apps. D-Central is one node in this ecosystem; verify the current details against each project repo. Free CSV/JSON under CC BY 4.0.

Download CSV Download JSON REST API →

Runtime	Category	Interface	Formats	OpenAI API	GPU	License
Ollama simplest local model runner	engine	CLI + server	GGUF (can import safetensors)	Yes	NVIDIA (CUDA), AMD (ROCm), Apple (Metal)	MIT
llama.cpp underlying GGUF inference engine many tools wrap	engine	CLI + server (also a C/C++ library)	GGUF	Yes	NVIDIA (CUDA), AMD (ROCm), Apple (Metal), Vulkan, CPU	MIT
LM Studio non-technical desktop users	app	Desktop GUI (+ local server, CLI 'lms')	GGUF, MLX	Yes	NVIDIA (CUDA), AMD (ROCm/Vulkan), Apple (Metal)	Closed (proprietary); free for personal use
vLLM max-throughput GPU serving	server	Server (API) + Python library	HF safetensors, GPTQ, AWQ	Yes	NVIDIA (CUDA) first, AMD (ROCm)	Apache-2.0
text-generation-webui (oobabooga) power users wanting many backends and quant formats	app	Web UI (+ API)	GGUF, EXL2, GPTQ, AWQ, HF safetensors	Yes	NVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPU	AGPL-3.0
Jan open-source offline desktop ChatGPT alternative	app	Desktop GUI (+ local server)	GGUF (llama.cpp engine)	Yes	NVIDIA (CUDA), AMD (Vulkan), Apple (Metal)	Apache-2.0
GPT4All privacy-focused desktop chat with local documents	app	Desktop GUI (+ local API server)	GGUF	Yes	NVIDIA/AMD (Vulkan), Apple (Metal), CPU	MIT
KoboldCpp single-binary story writing and roleplay	app	GUI launcher + Web UI + Server (API)	GGUF	Yes	NVIDIA (CUDA), AMD (Vulkan), Apple (Metal), CPU	AGPL-3.0
LocalAI self-hosted drop-in OpenAI API replacement	server	Server (API)	GGUF (plus multimodal/whisper/diffusers backends)	Yes	NVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPU	MIT
Llamafile single portable file that runs across OSes	engine	CLI + server (single-file executable)	GGUF	Yes	NVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPU	Apache-2.0
Open WebUI self-hosted chat UI in front of Ollama or OpenAI-compatible APIs	frontend	Web UI	N/A (uses a backend)	No	N/A (backend-dependent)	BSD-3-Clause (v0.6.6+ adds a branding clause + CLA)
MLX / mlx-lm Apple Silicon native inference	library	Python library + CLI	MLX (converts from HF safetensors)	Yes	Apple (Metal) only	MIT
Hugging Face TGI production serving of Hugging Face models	server	Server (API)	HF safetensors, GPTQ, AWQ	Yes	NVIDIA (CUDA) first, AMD (ROCm), Intel Gaudi	Apache-2.0
ExLlamaV2 / TabbyAPI fast EXL2 quantized inference on NVIDIA GPUs	engine	Python library + Server (API via TabbyAPI)	EXL2, GPTQ	Yes	NVIDIA (CUDA)	MIT (ExLlamaV2); AGPL-3.0 (TabbyAPI)
SGLang high-throughput structured/agentic GPU serving	server	Server (API) + Python library	HF safetensors, GPTQ, AWQ, FP8	Yes	NVIDIA (CUDA) first, AMD (ROCm)	Apache-2.0
AnythingLLM all-in-one local RAG desktop app	app	Desktop GUI + Web UI (Docker)	GGUF (built-in engine; also connects to external providers)	No	NVIDIA (CUDA), Apple (Metal), CPU	MIT

Source: each project's own repository and docs. Pairs with the local-LLM VRAM calculator, the self-hosting hub and the sovereign self-hosting catalog. For the RAG retrieval layer, see the local embedding models comparison. Local-AI tooling moves fast — confirm details against the project repo.

Related products, repair, and setup paths

Last reviewed July 18, 2026.