Definition
Inference is the stage at which a trained machine-learning model is actually used: it receives an input (such as a text prompt) and produces an output (a prediction or generated response). It is the "execution" phase, as opposed to training, which is the one-time "learning" phase where the model's weights are built from massive datasets. For sovereign, self-hosted AI, inference is the part that runs on your own hardware.
Forward Pass vs. Backward Pass
During inference, data flows through the network's layers in a single direction — the forward pass — to compute an output. Training additionally runs a backward pass that compares the output to a known answer and adjusts the weights. Because inference skips the backward pass and weight updates, it needs far less memory and compute than training, which is why a model that took a data center weeks to train can run on a laptop or even a phone.
Autoregressive Generation
For large language models, inference is autoregressive: each token is produced by one forward pass, appended to the running context, and fed back in to predict the next token. Throughput is commonly measured in tokens per second.
Local inference is the heart of self-hosting and air-gapped AI. Models are typically distributed as GGUF files for efficient on-device inference.
Estimate local inference in the inference cost calculator.
In Simple Terms
Inference is the stage at which a trained machine-learning model is actually used: it receives an input (such as a text prompt) and produces an…
