Definition
llama.cpp is an open-source inference engine, written primarily in C and C++, that runs large language models (LLMs) entirely on your own hardware. Founded by Georgi Gerganov, its stated goal is to enable LLM inference "with minimal setup and state-of-the-art performance on a wide range of hardware, locally and in the cloud." For sovereign Bitcoiners, it is the closest analogue in AI to running your own node: no API keys, no cloud account, and no data leaving the machine.
How it works
llama.cpp loads models in the GGUF format and supports aggressive integer quantization, from 8-bit down to roughly 1.5-bit, which shrinks a model enough to fit in the RAM or VRAM of ordinary computers. It runs across an unusually broad hardware base: Apple Silicon via Metal, x86 CPUs with AVX/AVX2/AVX-512, NVIDIA GPUs (CUDA), AMD GPUs (HIP/ROCm), Intel GPUs (SYCL), and even RISC-V. A machine you might already use for a mining dashboard or self-hosted services can often run a capable model.
Why it matters for sovereignty
The project ships llama-cli for interactive use and llama-server, which exposes an OpenAI-compatible HTTP API. That means tools written for hosted AI services can be pointed at a fully local endpoint instead. Much of the wider local-AI ecosystem, including several desktop front-ends, is built directly on top of llama.cpp as its underlying engine.
llama.cpp is the foundation most local runtimes share. To understand the model files it consumes, see GGUF; for friendlier wrappers around the same engine, see Ollama and LM Studio.
Find local-AI runtimes in the sovereign self-hosting catalog.
In Simple Terms
llama.cpp is an open-source inference engine, written primarily in C and C++, that runs large language models (LLMs) entirely on your own hardware. Founded by…
