Inference

Sovereign AI

Inference is the stage at which a trained machine-learning model is actually used: it receives an input (such as a text prompt) and produces an output (a prediction or generated response). It is the "execution" phase, as opposed to training, which is the one-time "learning" phase where the model's weights are built from massive datasets. For sovereign, self-hosted AI, inference is the part that runs on your own hardware.

Forward Pass vs. Backward Pass

During inference, data flows through the network's layers in a single direction — the forward pass — to compute an output. Training additionally runs a backward pass that compares the output to a known answer and adjusts the weights. Because inference skips the backward pass and weight updates, it needs far less memory and compute than training, which is why a model that took a data center weeks to train can run on a laptop or even a phone.

Autoregressive Generation

For large language models, inference is autoregressive: each token is produced by one forward pass, appended to the running context, and fed back in to predict the next token. Throughput is commonly measured in tokens per second.

Prefill versus decode: two different workloads

LLM inference actually has two phases with opposite performance profiles. Prefill processes your entire prompt at once — thousands of tokens in parallel — and is compute-bound: it stresses the GPU's raw arithmetic throughput, and you experience it as the pause before the first word appears. Decode then generates one token at a time, and is memory-bandwidth-bound: for every single token, the hardware must stream essentially all of the model's weights out of memory, so generation speed tracks memory bandwidth far more than core count. This is why a machine with modest compute but fast unified memory can out-generate a nominally stronger one, and why time-to-first-token and tokens-per-second are quoted as separate numbers — they measure different bottlenecks.

Sampling: how the next token is chosen

A forward pass does not output a word; it outputs a probability distribution over the whole vocabulary. A sampling step then picks the actual token, governed by parameters you control. Temperature scales how adventurous the choice is (low values make output focused and repeatable, high values make it varied); top-p and top-k restrict sampling to the most plausible candidates. The same model with the same prompt behaves very differently at temperature 0.2 versus 1.0 — worth remembering before blaming the model itself. For factual or technical work on local models, conservative sampling settings are usually the right default, and a fixed random seed makes runs reproducible when you are debugging a pipeline rather than chatting.

What this means for hardware you own

The practical recipe follows directly from the mechanics: weights plus KV cache must fit in fast memory, and memory bandwidth sets your generation ceiling. Quantized models shrink both the weight footprint and the bandwidth per token, which is why a well-chosen 4-bit model often gives the best real-world experience on consumer hardware. Batch size adds a final wrinkle: serving several requests at once costs little extra bandwidth per token, which is how one modest machine can serve a whole household's AI needs. A newer trick, speculative decoding, exploits the same asymmetry: a small draft model proposes several tokens cheaply and the large model verifies them in one parallel pass, accepting the correct ones — trading spare compute for scarce bandwidth to speed up generation without changing the output distribution.

Local inference is the heart of self-hosting and air-gapped AI. Models are typically distributed as GGUF files for efficient on-device inference.

Estimate local inference in the inference cost calculator.

Inference is the stage at which a trained machine-learning model is actually used: it receives an input (such as a text prompt) and produces an…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners