Prefill Phase

Sovereign AI

Prefill phase is the first of two stages in large language model (LLM) inference. When you send a prompt, the model processes every input token at once, in parallel, computing the attention states for the whole sequence and storing them in the key-value (KV) cache. The prefill phase ends when the first output token is produced. Because all prompt tokens are handled simultaneously, prefill is compute-bound rather than memory-bound: it saturates the GPU's matrix-multiply units, and its cost grows steeply with prompt length — the attention computation scales roughly with the square of the sequence.

Why prefill dominates first-response time

Prefill is the work that happens before any text appears, so it largely determines how long a user waits for the model to start replying — the metric formalized as time to first token (TTFT). Everything you stuff into the prompt lengthens it: a long system prompt, documents retrieved by a RAG pipeline, conversation history accumulating toward the context window limit. This is why a chatbot answers a short question instantly but pauses noticeably after you paste in a forty-page document — the pause is prefill. On a self-hosted rig, prefill is where a big GPU earns its keep: the massively parallel arithmetic maps onto the hardware much as a hashboard's parallel ASIC cores map onto SHA-256 work, and compute throughput translates directly into shorter waits.

Prefill vs decode

Prefill is fundamentally different from the token-by-token generation that follows, the decode phase. Once the KV cache is built, each new token only needs the cached states plus one fresh computation — light on arithmetic, heavy on memory traffic, since the model's weights must stream past for every single token. The two phases therefore stress opposite ends of the hardware: prefill wants raw compute, decode wants memory bandwidth. A machine can be excellent at one and mediocre at the other, which is why honest benchmarks always quote prefill (prompt-processing) speed and generation speed separately, in tokens per second, rather than one blended number.

The square-law growth is also why context length is not free even when it fits in memory: doubling the prompt roughly quadruples the attention work in naive implementations, and although optimized kernels soften the curve, very long contexts still make the GPU work hardest precisely when the user is watching a blank screen.

What a self-hoster does with this

Understanding prefill turns several knobs from mysterious to obvious. Trim the prompt: every token you do not send is prefill you do not pay for, so lean system prompts and tight RAG retrieval directly cut latency. Reuse the cache: most local runtimes keep the KV cache across turns of a conversation, so only new tokens are prefilled — and prefix caching extends the same trick across requests that share an opening. Serving stacks go further with chunked prefill, splitting a long prompt into pieces so the engine can interleave prefill work with ongoing generation for other users, smoothing GPU utilization on a shared box. And when sizing hardware, match the machine to the workload: a document-heavy RAG assistant lives in prefill and rewards compute; a conversational chatbot lives in decode and rewards memory bandwidth. Quantization helps both phases but differently — smaller weights mean less memory traffic in decode, while prefill gains mostly when the arithmetic itself runs in a faster low-precision format. Reading benchmarks with that split in mind is how you buy the right silicon instead of the best-marketed one — running models on your own terms starts with knowing where the time actually goes.

Model your throughput in the inference cost calculator.

Prefill phase is the first of two stages in large language model (LLM) inference. When you send a prompt, the model processes every input token…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners