Definition
Chunked prefill is a scheduling optimization used when serving large language models on a GPU. Inference has two phases with opposite hardware profiles: the prefill phase processes the entire input prompt in one shot and is compute-bound, while the decode phase generates one token at a time and is memory-bandwidth-bound. Running them naively means a single long prompt can monopolize the GPU and stall every in-flight generation, spiking inter-token latency for everyone else on the server.
How it works
Rather than processing a 16,000-token prompt as one giant forward pass, chunked prefill breaks it into fixed-size slices (for example, 512 tokens) and feeds one slice per scheduling step. The freed capacity in each step is filled with decode requests that would otherwise be waiting. This "piggybacking" of decodes onto partial prefills, introduced in the Sarathi work and adopted by serving engines such as vLLM, keeps the GPU's compute units and memory subsystem both busy instead of swinging between the two extremes.
Why it matters for self-hosters
For a sovereign operator running an inference rig on their own hardware, chunked prefill is the difference between a model that feels responsive under concurrent load and one that freezes whenever a user pastes a long document. It is tuned through a single knob, the maximum batched-token budget per step: smaller chunks improve smoothness of token streaming, larger chunks improve raw throughput. There is no accuracy cost because the math of attention is unchanged; only the order of computation is rearranged.
Chunked prefill works hand-in-hand with other serving optimizations. It complements continuous batching by giving the scheduler partial-prefill work to interleave, and it shares the same underlying state managed by the KV cache that every decode step reads from.
In Simple Terms
Chunked prefill is a scheduling optimization used when serving large language models on a GPU. Inference has two phases with opposite hardware profiles: the prefill…
