Ollama

Sovereign AI

Ollama is an open-source platform for running large language models locally with a single command. Written mainly in Go and using llama.cpp as its inference engine, it packages model weights, configuration, and prompt templates into a simple workflow: ollama run downloads a model, loads it onto your GPU or CPU, and drops you into a chat session. Its whole purpose is to make self-hosted AI as approachable as installing any other developer tool — no Python environments, no CUDA wrangling, no manual model conversion.

Model management, done like containers

Ollama treats models much the way Docker treats images. You ollama pull a named model from its registry, it lands in a local cache, and you can list, swap, or remove it cleanly. Each model ships with a Modelfile — a small manifest that pins the weights, the chat template, and default sampling parameters — and you can write your own Modelfile to derive a custom variant: change the system prompt, adjust the context length, or bake in your preferred temperature. Under the hood the weights are stored in the GGUF format, the quantized single-file format that llama.cpp made the de facto standard for local models. Most models in the registry come pre-quantized at several sizes, so you pick the variant that fits your card's VRAM rather than converting anything yourself.

The local API layer

Beyond the interactive terminal, Ollama runs as a background server exposing a REST API on port 11434 — a POST to localhost:11434/api/chat or /api/generate returns completions from whatever model you name. It also exposes an OpenAI-compatible endpoint, which matters more than it sounds: a large share of AI tooling speaks that API, so pointing existing software at your own machine is often a one-line base-URL change. Official Python and JavaScript client libraries round it out. The practical result is that a home server can quietly serve inference to every application in the house, with no prompt ever leaving your network.

Where it fits in a sovereign stack

Ollama runs on macOS, Windows, Linux, and Docker, and it handles the unglamorous operational details — loading and unloading models on demand, keeping recently used models warm in memory, splitting layers between GPU and system RAM when a model doesn't quite fit. Because it builds on llama.cpp, it inherits that engine's broad hardware support: NVIDIA and AMD GPUs, Apple Silicon, and plain CPUs all work. For the sovereign-minded, that is the point. The same instinct that says run your own Bitcoin node rather than trusting someone else's says run your own models rather than renting a black box: your questions, your documents, and your embeddings stay on hardware you own. Ollama is not the fastest server for heavy multi-user loads — dedicated engines win there — but for one person or one household getting real work out of local models, it is the shortest path from bare metal to a working assistant.

Practical starting point

A workable recipe: install Ollama, pull a small instruction-tuned model at 4-bit quantization, and confirm it answers in the terminal. Then wire one application — a chat UI, an editor plugin, a RAG pipeline over your own notes — into the local API. From there, scaling up is just pulling bigger models as your VRAM budget allows.

Its limits are worth knowing too. Ollama optimizes for convenience over raw throughput: the defaults favor conservative context lengths and quantization levels chosen for broad hardware, so power users eventually peek beneath the hood — adjusting context size in a Modelfile, choosing a heavier quantization for quality, or graduating to a dedicated serving engine when several users hammer one box at once. None of that diminishes what it does best. Most people never run a local model because the first hour defeats them; Ollama compresses that first hour into five minutes, and five minutes is a price anyone will pay for an assistant that works during an internet outage, costs nothing per query, and never phones home. For the plebs building a sovereign homestead stack, it is the standard on-ramp.

Find local-AI runtimes in the sovereign self-hosting catalog.

Ollama is an open-source platform for running large language models locally with a single command. Written mainly in Go and using llama.cpp as its inference…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners