llama.cpp

Sovereign AI

llama.cpp is an open-source inference engine, written primarily in C and C++, that runs large language models (LLMs) entirely on your own hardware. Founded by Georgi Gerganov, its stated goal is to enable LLM inference "with minimal setup and state-of-the-art performance on a wide range of hardware, locally and in the cloud." For sovereign Bitcoiners, it is the closest analogue in AI to running your own node: no API keys, no cloud account, and no data leaving the machine.

How it works

llama.cpp loads models in the GGUF format and supports aggressive integer quantization, from 8-bit down to roughly 1.5-bit, which shrinks a model enough to fit in the RAM or VRAM of ordinary computers. It runs across an unusually broad hardware base: Apple Silicon via Metal, x86 CPUs with AVX/AVX2/AVX-512, NVIDIA GPUs (CUDA), AMD GPUs (HIP/ROCm), Intel GPUs (SYCL), and even RISC-V. A machine you might already use for a mining dashboard or self-hosted services can often run a capable model.

The trick that makes modest hardware useful: layer offloading

llama.cpp's signature flexibility is that a model does not have to fit entirely on the GPU. Its layer-offload mechanism lets you place as many transformer layers as fit into VRAM and run the remainder on the CPU from system RAM — one flag, any split. A machine with an 8 GB GPU and 32 GB of RAM can therefore run models far larger than the card alone could hold, at speeds that degrade gracefully rather than failing outright. The performance rule of thumb follows from how inference works: generation speed is governed by memory bandwidth, so layers on the GPU are fast, layers in system RAM are slower, and the overall tokens-per-second lands in between, roughly in proportion to the split. Full-GPU is best, full-CPU still works — which is exactly the property that makes local AI accessible instead of gated behind flagship hardware.

From library to daily driver

The project ships practical tooling on top of the core engine. llama-cli covers interactive and scripted use; llama-server exposes an OpenAI-compatible HTTP API, meaning tools written for hosted AI services can be pointed at a fully local endpoint by changing one URL. Quantization utilities convert and requantize models, and the same GGUF file runs unmodified on a Mac, a Linux server, or a Windows gaming PC. Much of the wider local-AI ecosystem — several desktop front-ends and runtime wrappers among them — is built directly on llama.cpp as the underlying engine, so skills learned here transfer across most of the local stack. Development moves fast — support for new model architectures typically lands within days of release — so keeping a build current is part of the workflow, and the project's permissive MIT license means nothing stops you from patching, forking, or embedding it in your own tools.

Why it matters for sovereignty

llama.cpp collapses the distance between "AI user" and "AI operator." There is no account, no telemetry requirement, no usage meter, and no counterparty who can revoke access or change terms; the model file and the binary are yours, and they keep working offline indefinitely — the same model that answers you today will answer identically in ten years, on whatever hardware you still own. That is self-custody logic applied to computation, and it pairs naturally with the rest of a self-hosted stack — the same box that runs your node and dashboards can serve a private assistant grounded in your own documents.

llama.cpp is the foundation most local runtimes share. To understand the model files it consumes, see GGUF; for friendlier wrappers around the same engine, see Ollama and LM Studio.

Find local-AI runtimes in the sovereign self-hosting catalog.

llama.cpp is an open-source inference engine, written primarily in C and C++, that runs large language models (LLMs) entirely on your own hardware. Founded by…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners