Chunked Prefill

Sovereign AI

Chunked prefill is a scheduling optimization used when serving large language models on a GPU. Inference has two phases with opposite hardware profiles: the prefill phase processes the entire input prompt in one shot and is compute-bound, while the decode phase generates one token at a time and is memory-bandwidth-bound. Running them naively means a single long prompt can monopolize the GPU and stall every in-flight generation, spiking inter-token latency for everyone else on the server.

How it works

Rather than processing a 16,000-token prompt as one giant forward pass, chunked prefill breaks it into fixed-size slices — for example, 512 tokens — and feeds one slice per scheduling step. The spare capacity in each step is filled with decode requests that would otherwise be waiting their turn. This “piggybacking” of decodes onto partial prefills, introduced in the Sarathi work and adopted by serving engines such as vLLM, keeps the GPU's compute units and memory subsystem both busy instead of swinging between the two extremes and wasting one while saturating the other. The result is a scheduler that stays productive on every step rather than lurching between compute-heavy and memory-heavy work.

The one knob and its trade-off

Chunked prefill is tuned through a single number: the maximum batched-token budget per step. Smaller chunks improve the smoothness of token streaming for everyone, because no single prefill hogs a step, but they add scheduling overhead and can lower peak throughput. Larger chunks push raw throughput up at the cost of occasional latency spikes when a big prefill lands. There is no accuracy cost either way, because the mathematics of attention is unchanged — only the order in which the computation happens is rearranged — so the knob is purely a latency-versus-throughput dial the operator sets to taste.

Why it matters for self-hosters

For a sovereign operator running an inference rig on their own hardware, chunked prefill is the difference between a model that stays responsive under concurrent load and one that freezes the instant a user pastes a long document. On a home server with one or two accelerators shared among a household or a small team, that fairness is what makes local hosting feel like a real service instead of a demo. It matters most precisely where budgets are tight and every millisecond of a shared card counts, which is the typical situation for anyone hosting their own model rather than renting cloud capacity.

Where it fits

Chunked prefill is one of a family of scheduler techniques that let a single GPU behave like a fair, multi-user service rather than a first-come-first-served queue. It rarely works alone; it is designed to slot into a scheduler that is already juggling many concurrent requests.

A concrete way to feel the benefit is to picture two users on the same home server. One pastes a long article and asks for a summary; the other is midway through a quick back-and-forth chat. Without chunked prefill, the long article's prefill seizes the GPU for one enormous step and the chat user watches their reply freeze mid-sentence. With it, the article is digested a slice at a time and the chat user's tokens keep streaming in the gaps between those slices, so neither request starves the other. Multiply that across a household or a small team sharing one accelerator and the technique is what separates a local model that feels like a shared utility from one where a single heavy request makes everyone else wait. The knob that controls it is best set by watching real latency under your own typical mix of short and long prompts.

Chunked prefill complements continuous batching by giving the scheduler partial-prefill work to interleave, shares the underlying state managed by the KV cache that every decode step reads from, and pairs with prefix caching, which skips recomputing prompt prefixes that requests already share.

Chunked prefill is a scheduling optimization used when serving large language models on a GPU. Inference has two phases with opposite hardware profiles: the prefill…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners