Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Local AI Runtime Comparison: Ollama vs llama.cpp vs vLLM and more

Quick answer

This compares 16 local AI inference runtimes -- the software you run open-weight LLMs with on your own hardware, no cloud API required. For each it lists the category (engine, desktop app, server or frontend), how you interact with it, which model formats it loads, whether it exposes an OpenAI-compatible API (14 do, so you can point existing apps at your own machine), GPU support, license and OS.

There is no single best runtime: pick by your interface (CLI vs desktop GUI vs server), your hardware (NVIDIA vs Apple Silicon vs CPU) and your license needs. Many of these wrap llama.cpp under the hood, so GGUF is the common model format. An OpenAI-compatible API is the key to sovereignty -- it lets you swap a cloud endpoint for your own without rewriting your apps. D-Central is one node in this ecosystem; verify the current details against each project repo. Free CSV/JSON under CC BY 4.0.

Download CSV Download JSON REST API →

RuntimeCategoryInterfaceFormatsOpenAI APIGPULicense
Ollama
simplest local model runner
engineCLI + serverGGUF (can import safetensors)YesNVIDIA (CUDA), AMD (ROCm), Apple (Metal)MIT
llama.cpp
underlying GGUF inference engine many tools wrap
engineCLI + server (also a C/C++ library)GGUFYesNVIDIA (CUDA), AMD (ROCm), Apple (Metal), Vulkan, CPUMIT
LM Studio
non-technical desktop users
appDesktop GUI (+ local server, CLI 'lms')GGUF, MLXYesNVIDIA (CUDA), AMD (ROCm/Vulkan), Apple (Metal)Closed (proprietary); free for personal use
vLLM
max-throughput GPU serving
serverServer (API) + Python libraryHF safetensors, GPTQ, AWQYesNVIDIA (CUDA) first, AMD (ROCm)Apache-2.0
text-generation-webui (oobabooga)
power users wanting many backends and quant formats
appWeb UI (+ API)GGUF, EXL2, GPTQ, AWQ, HF safetensorsYesNVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPUAGPL-3.0
Jan
open-source offline desktop ChatGPT alternative
appDesktop GUI (+ local server)GGUF (llama.cpp engine)YesNVIDIA (CUDA), AMD (Vulkan), Apple (Metal)Apache-2.0
GPT4All
privacy-focused desktop chat with local documents
appDesktop GUI (+ local API server)GGUFYesNVIDIA/AMD (Vulkan), Apple (Metal), CPUMIT
KoboldCpp
single-binary story writing and roleplay
appGUI launcher + Web UI + Server (API)GGUFYesNVIDIA (CUDA), AMD (Vulkan), Apple (Metal), CPUAGPL-3.0
LocalAI
self-hosted drop-in OpenAI API replacement
serverServer (API)GGUF (plus multimodal/whisper/diffusers backends)YesNVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPUMIT
Llamafile
single portable file that runs across OSes
engineCLI + server (single-file executable)GGUFYesNVIDIA (CUDA), AMD (ROCm), Apple (Metal), CPUApache-2.0
Open WebUI
self-hosted chat UI in front of Ollama or OpenAI-compatible APIs
frontendWeb UIN/A (uses a backend)NoN/A (backend-dependent)BSD-3-Clause (v0.6.6+ adds a branding clause + CLA)
MLX / mlx-lm
Apple Silicon native inference
libraryPython library + CLIMLX (converts from HF safetensors)YesApple (Metal) onlyMIT
Hugging Face TGI
production serving of Hugging Face models
serverServer (API)HF safetensors, GPTQ, AWQYesNVIDIA (CUDA) first, AMD (ROCm), Intel GaudiApache-2.0
ExLlamaV2 / TabbyAPI
fast EXL2 quantized inference on NVIDIA GPUs
enginePython library + Server (API via TabbyAPI)EXL2, GPTQYesNVIDIA (CUDA)MIT (ExLlamaV2); AGPL-3.0 (TabbyAPI)
SGLang
high-throughput structured/agentic GPU serving
serverServer (API) + Python libraryHF safetensors, GPTQ, AWQ, FP8YesNVIDIA (CUDA) first, AMD (ROCm)Apache-2.0
AnythingLLM
all-in-one local RAG desktop app
appDesktop GUI + Web UI (Docker)GGUF (built-in engine; also connects to external providers)NoNVIDIA (CUDA), Apple (Metal), CPUMIT

Source: each project's own repository and docs. Pairs with the local-LLM VRAM calculator, the self-hosting hub and the sovereign self-hosting catalog. Local-AI tooling moves fast — confirm details against the project repo.