Definition
The prefill phase is the first of two stages in large language model (LLM) inference. When you send a prompt, the model processes every input token at once, in parallel, computing the attention states for the whole sequence and storing them in the key-value (KV) cache. The prefill phase ends when the first output token is produced. Because all prompt tokens are handled simultaneously, prefill is compute-bound rather than memory-bound: it saturates the GPU's matrix-multiply units and scales roughly with the square of the prompt length.
Why prefill dominates first-response time
Prefill is the work that happens before any text appears, so it largely determines how long a user waits for the model to start replying. A long prompt, a system prompt, retrieved documents, or a large context window all lengthen prefill. On a self-hosted sovereign AI rig, prefill is where a big GPU earns its keep — the parallel arithmetic maps well to the hardware, much as a hashboard's parallel ASIC cores map well to SHA-256 work.
Prefill vs decode
Prefill is fundamentally different from the token-by-token generation that follows. Once the cache is built, each new token only needs the cached states plus one fresh computation, so the second stage behaves very differently. Optimizations like chunked prefill split a long prompt into pieces so the engine can interleave prefill work with ongoing generation, smoothing GPU utilization on a shared box.
Understanding prefill helps you size hardware and read benchmarks honestly when running models on your own terms. See the decode phase for the second stage, and time to first token (TTFT) for the metric prefill drives.
Model your throughput in the inference cost calculator.
In Simple Terms
The prefill phase is the first of two stages in large language model (LLM) inference. When you send a prompt, the model processes every input…
