Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Ollama vs vLLM vs llama.cpp: Local Inference Server Comparison (2026)



Quick answer: For a single developer getting started in under five minutes, Ollama wins on simplicity. For a team serving ten or more concurrent users on NVIDIA or AMD hardware, vLLM wins on throughput — often by a factor of ten or more in head-to-head tests. For CPU-only machines, embedded systems, or maximum low-level control, llama.cpp is the direct choice. All three are open-source, free, and complement each other — the right answer depends on your hardware and concurrency target, not a horse race.

Choosing a local inference runtime is one of the first decisions you make when you stop renting AI from someone else’s cloud. Three projects dominate the conversation in 2026: Ollama, vLLM, and llama.cpp. Each one was built by a different team for a different problem. Understanding that is more useful than any benchmark table.

This guide covers architecture, throughput, hardware requirements, and a concrete decision tree — including how each runtime maps to the hardware tiers we configure at D-Central. If you are still deciding whether local AI is worth it at all, start with replacing cloud AI with a local LLM. If you need help sizing your hardware first, use the local LLM VRAM calculator before reading the throughput numbers below.

The three projects — who built them and why

llama.cpp — the foundation layer

llama.cpp is the work of Georgi Gerganov, who released it in March 2023 within days of Meta’s LLaMA weights leaking. The goal was blunt: run a 13-billion-parameter model on a MacBook. Gerganov succeeded, and in doing so he created the GGUF quantization format that the entire open-source local-AI ecosystem now uses. Every tool on this page — including Ollama — either uses llama.cpp directly under the hood or was built partly in response to it.

llama.cpp handles CPU inference natively, with AVX2/AVX-512 SIMD paths for x86 and NEON paths for ARM. GPU acceleration plugs in via -DGGML_CUDA=ON for NVIDIA, -DGGML_METAL=ON for Apple Silicon, -DGGML_HIP=ON for AMD ROCm, and a Vulkan backend for cross-platform GPU offload. The GGUF file format stores quantized weights and model metadata together, and the quantize tool produces variants from Q2_K (smallest, lowest quality) through Q8_0 (near-lossless) and F16 (full precision). As of mid-2026, experimental ternary quantization (BitNet-style 1.58-bit weights) is under active development and could eventually allow 70B models to run in approximately 14 GB of RAM — but this is not yet production-stable.

The strength of llama.cpp is control and reach: it runs on hardware that no other runtime touches — CPU-only servers, Raspberry Pi 5, single-board ARM machines, old workstations with no CUDA-capable GPU. The tradeoff is that you configure everything yourself: batch sizes, KV cache limits, GPU layer counts, context windows. There is a server mode (llama-server) with an OpenAI-compatible endpoint, but it is a single-binary utility, not a model management platform.

Ollama — llama.cpp with a management layer

Ollama was released in 2023 by the Ollama team and reached version 0.24.x by May 2026. Its core insight was that most developers do not want to compile backends, locate GGUF files, or write shell scripts to swap models. Ollama packages llama.cpp inference behind a clean CLI (ollama run llama3), a Docker-compatible model registry, an OpenAI-compatible REST API on port 11434, and a declarative Modelfile format for custom configurations.

On Apple Silicon, Ollama 0.19+ routes through MLX rather than llama.cpp’s Metal path, which measurably improves throughput on M-series chips. On Linux/Windows with NVIDIA hardware, it uses CUDA through the llama.cpp backend. AMD ROCm is supported on Linux. Hardware detection is automatic — Ollama selects Metal, CUDA, ROCm, or CPU paths without manual flags.

Ollama’s 0.24.0 release added native integration with agentic coding tools including Claude Code, OpenAI Codex, and Copilot CLI via the ollama launch command, making it practical for local-first development workflows. Model management (ollama list, ollama pull, ollama rm) is the cleanest in the category. The limitation is concurrency: Ollama processes requests serially by default, and latency under more than five or six simultaneous users degrades quickly.

vLLM — production throughput engine

vLLM was published in 2023 by the UC Berkeley Sky Computing Lab, led by Woosuk Kwon, Zhuohan Li, and colleagues. The paper introduced PagedAttention, a memory management algorithm that treats the KV (key-value) cache the way an operating system treats virtual memory — splitting it into non-contiguous pages. The result is up to 96% reduction in KV cache memory waste versus naïve implementations, which translates directly into the ability to run more concurrent requests on the same GPU.

vLLM layers on continuous batching (requests are added to the batch mid-flight rather than waiting for the current batch to finish), optimized CUDA/HIP kernels, tensor parallelism across multiple GPUs, and a drop-in OpenAI-compatible API server. It supports NVIDIA, AMD, Intel Gaudi accelerators, AWS Trainium/Inferentia, and IBM Power CPUs. Model formats include SafeTensors (native), GPTQ, AWQ, and FP8 quantization — GGUF is not natively supported, which means models must come from Hugging Face rather than the Ollama registry.

vLLM’s weakness is the setup floor: it requires Python, a compatible GPU with at least 8 GB of VRAM for a 7B model at INT4, and some familiarity with command-line server management. There is no GUI and no automatic model management comparable to Ollama’s. But in exchange, the throughput headroom at concurrent load is in a different tier from either alternative.

Feature comparison

Feature llama.cpp Ollama vLLM
Setup complexity Medium (compile flags, manual config) Low (one-line install) Medium-high (Python env, GPU driver)
CPU-only inference Yes — first-class Yes (via llama.cpp backend) Limited (some CPU backends; not primary)
NVIDIA CUDA Yes Yes Yes — primary target
AMD ROCm Yes (Linux) Yes (Linux) Yes
Apple Silicon (Metal / MLX) Yes (Metal) Yes (MLX path on 0.19+) No
Model format GGUF GGUF (via Ollama registry) SafeTensors, GPTQ, AWQ, FP8
OpenAI-compatible API Yes (llama-server) Yes (port 11434) Yes (port 8000)
Multi-GPU tensor parallelism Experimental No Yes — production-grade
Concurrent request handling Manual batching Serial (queue-based) Continuous batching (PagedAttention)
Model management CLI Manual file management Yes (ollama pull/list/rm) No (download manually)
Windows support Yes Yes WSL2 recommended
Primary audience Developers, embedded, research Developers, solo users, agentic tools Production teams, multi-user serving

Throughput and concurrency

Throughput numbers vary significantly with hardware, model size, quantization level, batch size, and concurrency. The figures below are drawn from third-party benchmarks published in 2025–2026; treat them as directional, not absolute, and re-run on your own hardware before making deployment decisions.

Single-user latency

At a single-user workload, Ollama and llama.cpp perform similarly — Ollama adds roughly 10–30% management overhead in raw throughput tests versus direct llama.cpp calls, a gap that matters much less than the practical convenience gain. vLLM’s advantage at this tier is smaller because its batching machinery is underutilized with one active request.

Concurrent-user throughput

This is where the tools diverge. Benchmarks published in 2025–2026 consistently show vLLM pulling ahead sharply once concurrent users exceed five or six. One representative test (hardware: NVIDIA A100 80GB, model: Llama 3 8B FP16, 8+ concurrent users) recorded vLLM at approximately 187 tok/s aggregate versus Ollama’s degraded output under the same load. A separate test on NVIDIA A10G with INT4 (AWQ) quantization recorded vLLM at 385 tok/s at batch size 8. Under a concurrent-load stress test where requests were fired simultaneously, one benchmark recorded vLLM at roughly 793 aggregate tok/s versus Ollama at approximately 41 tok/s. Specific numbers vary — what is consistent across sources is that vLLM’s concurrent throughput advantage over Ollama is measured in multiples, not percentages, once load exceeds a single-digit number of parallel users.

For broader context: vLLM’s authors benchmarked 14–24× higher throughput than HuggingFace Transformers on the same hardware, attributing the gain to continuous batching, PagedAttention, and optimized CUDA kernels. These gains apply to multi-user serving scenarios.

CPU-only inference

Neither Ollama nor vLLM is a practical choice when no discrete GPU is available. llama.cpp direct is the only option that targets CPU inference as a first-class workload, using AVX2/AVX-512 (x86) or NEON (ARM) acceleration. Expect roughly 10–30 tok/s on a modern x86 desktop CPU for a 7B Q4 model — usable for personal, low-latency-tolerant applications, not for serving multiple users.

Hardware requirements and VRAM sizing

As a quick rule of thumb, budget approximately 0.6 GB of VRAM per billion parameters at Q4_K_M quantization. A 7B model needs 4–6 GB; a 13B model needs 8–10 GB; a 70B model at INT4 needs approximately 35–40 GB (multi-GPU or CPU offload required). For full-precision FP16, roughly double those numbers.

Use the local LLM VRAM calculator to get a sizing estimate for your specific model and quantization target. For hardware selection guidance — GPU tiers, CPU recommendations, RAM requirements — see the local AI hardware guide.

VRAM tier What fits (Q4_K_M) Best runtime
0 GB (CPU only) Up to 7B (slow) llama.cpp direct
6–8 GB Up to 7B comfortably Ollama
12–16 GB Up to 13B; 7B with long context Ollama (single user); vLLM (multi-user)
24 GB Up to 34B; 70B with CPU offload vLLM (multi-user) or Ollama (dev)
40–80 GB (single GPU) 70B FP16; large context vLLM
Multi-GPU (2×80 GB +) 405B+; tensor parallel vLLM with tensor parallelism

Decision tree — which runtime for your situation

Work through this in order:

  1. No GPU available (CPU only)?llama.cpp direct. Ollama works on CPU too but adds overhead without adding value in a headless server context.
  2. Apple Silicon (M1/M2/M3/M4)?Ollama. Ollama 0.19+ uses the MLX path on Apple Silicon, which is currently the fastest available inference path on those chips. MLX standalone is an option for advanced users who want even lower overhead.
  3. One developer, prototyping, local laptop or workstation with NVIDIA/AMD GPU?Ollama. The install is one command, the model registry handles downloads, and the OpenAI-compatible API drops into any LLM client or agent framework.
  4. Production serving — five or more concurrent users, NVIDIA GPU, Linux?vLLM. The concurrency gap is too large to ignore at this scale, and vLLM’s OpenAI-compatible endpoint makes integration straightforward.
  5. Need maximum control — custom quantization, speculative decoding tuning, multi-GPU without Python overhead, or embedded target?llama.cpp direct. Every parameter is configurable; no management layer stands between you and the inference engine.
  6. Agentic coding tools (Claude Code, Codex, Copilot CLI) local routing?Ollama with the ollama launch command, which natively integrates with those tools as of version 0.24.0.

D-Central hardware tier mapping

D-Central configures local AI infrastructure for SMBs, Bitcoiners, and sovereignty-focused teams. Here is how each runtime maps to the hardware tiers we work with — not a product price list, but a practical pairing guide. Pricing and availability for configured systems are quote-only; contact the team for your situation.

Tier Example hardware Recommended runtime Practical capacity
Pleb / personal Apple MacBook M3 Pro, gaming PC with RTX 3070/4070 Ollama 7B–13B solo user; coding assistant, local RAG
Small team / SMB workstation Workstation with RTX 3090/4090 (24 GB VRAM) Ollama (dev) or vLLM (shared) 13B–34B; 3–5 concurrent users with vLLM
Pleb AI Box Mini-PC, Ryzen 9 + 64 GB RAM, iGPU or RX 7600 Ollama or llama.cpp 7B–13B; personal or small family/office use
Hashcenter AI Node Dual A6000 or A100 80GB (multi-GPU) vLLM (tensor parallel) 70B production serving; 10–50 concurrent users
DGX Spark / workstation-class NVIDIA DGX Spark (128 GB unified memory) vLLM 405B+ models; enterprise-scale on-prem AI

All AI hardware configurations D-Central ships run only open-weight models on hardware you physically control. No usage data leaves your premises. That is the sovereignty point: the inference runtime is almost a secondary question compared to the question of whether the compute stays in your building. See sovereign AI in Canada for the broader context.

Mixing the stack — a common pattern

Many teams end up running more than one runtime for different purposes. A typical setup: Ollama on the developer’s workstation for fast local iteration and Claude Code / Codex integration; vLLM on a shared GPU server for production team-facing endpoints; llama.cpp direct for an embedded low-power node that needs to run without a GPU. The OpenAI-compatible API that all three expose means switching at the application layer is a one-line config change.

There is no reason to treat these projects as competitors. Ollama builds on the foundation that Georgi Gerganov created with llama.cpp. vLLM’s PagedAttention was an original systems research contribution from Berkeley that the broader community — including llama.cpp — has learned from. Each project pushes the others forward.

Frequently asked questions

Is Ollama just a wrapper around llama.cpp?

Yes, with qualifications. Ollama uses llama.cpp as its inference backend on most platforms, and adds a model registry, CLI, REST API layer, and Modelfile configuration system. On Apple Silicon (M1/M2/M3/M4), Ollama 0.19+ switched to the MLX backend for improved performance. So “wrapper” is accurate at the inference-engine level, but the management tooling Ollama provides is substantial and genuinely reduces setup friction.

Can vLLM run on CPU or Apple Silicon?

vLLM is primarily designed for NVIDIA and AMD GPUs on Linux. There are experimental CPU backends and limited Intel Gaudi / AWS Trainium support, but vLLM on Apple Silicon is not a supported target as of mid-2026. If you are on Apple Silicon, Ollama (with its MLX path) or llama.cpp with Metal acceleration are the appropriate choices.

What is PagedAttention and why does it matter?

PagedAttention is a memory management algorithm developed at UC Berkeley for vLLM. It splits the KV (key-value) cache — the memory structure that stores context as tokens are generated — into non-contiguous pages, similar to how an OS manages virtual memory. The result is up to 96% reduction in KV cache memory waste compared to naïve implementations, allowing more concurrent requests to share the same GPU memory. This is the primary reason vLLM’s concurrent throughput scales so much better than alternatives.

Do I need a GPU to run any of these locally?

No. All three support CPU-only inference. llama.cpp is the most optimized for this use case. Expect roughly 10–30 tokens per second for a 7B model at Q4_K_M quantization on a modern x86 desktop CPU — slow enough to feel on longer outputs, fast enough for personal use where you are not waiting at a terminal. For team or production use without GPU hardware, the experience degrades significantly; a dedicated GPU is the practical prerequisite for anything beyond personal experimentation.

Which runtime should I use for local RAG (retrieval-augmented generation)?

For solo RAG workflows — document search, personal knowledge bases, local Q&A — Ollama is the easiest starting point. Its OpenAI-compatible API means frameworks like LangChain, LlamaIndex, and Open WebUI drop in without modification. For multi-user RAG serving (a team sharing a document search backend), vLLM handles the concurrent embedding and generation requests better. For embedding generation specifically, llama.cpp’s server mode and dedicated embedding endpoints are a lightweight option that does not require the full Ollama stack.

What is GGUF and does vLLM support it?

GGUF (GPT-Generated Unified Format) is the model file format developed by Georgi Gerganov for llama.cpp. It stores quantized weights and model metadata in a single file optimized for efficient loading. Ollama uses GGUF models via its registry. vLLM does not natively support GGUF — it expects models in SafeTensors, GPTQ, AWQ, or FP8 formats, typically downloaded from Hugging Face. If you want to use a GGUF model with vLLM, you need to convert it first, which adds friction.

How do I run local AI for an entire office team in Canada?

A practical small-team setup uses a single GPU server (24–80 GB VRAM depending on model size) running vLLM, with an Nginx reverse proxy in front of the OpenAI-compatible API. Users point their existing AI tools at the local endpoint. No data leaves the building. D-Central configures these systems for Canadian SMBs — see the AI sovereignty consulting page for service tiers, or the local AI hardware guide for hardware sizing before you reach out.

Is DCENT_OS compatible with these inference runtimes?

DCENT_OS is currently in closed beta under GPL-3.0, with a public beta targeted for summer 2026. Specific integration details will be published when the beta opens. Generally, any system running DCENT_OS that includes GPU hardware can run Ollama or vLLM as userspace services; the OS layer does not restrict inference runtime choice.

What is the quickest way to get started with local AI today?

Install Ollama (ollama.com), run ollama pull llama3, then ollama run llama3. You will be talking to a local 8B model in under ten minutes on any modern laptop with 8 GB or more of RAM. From there, the local AI hardware guide and replacing cloud AI with a local LLM cover the next steps — better hardware, production configuration, and connecting your tools to the local endpoint.