Definition
Prefill-decode disaggregation is a serving architecture that runs the two phases of language-model inference on physically separate hardware. The prefill phase, which ingests the prompt, is compute-bound: it does heavy parallel matrix work in one burst. The decode phase, which emits tokens one at a time, is memory-bound: it does little arithmetic but constantly reads stored state from memory. Packing both onto the same GPU forces them to fight over resources and interfere with each other's latency.
Separating the phases
Disaggregation assigns prefill and decode to different GPU pools. Prompts are processed on prefill nodes chosen for raw compute throughput; the resulting attention state is transferred to decode nodes chosen for large, high-bandwidth memory, where token generation proceeds undisturbed. Each pool can be scaled, batched, and parallelized independently, and neither phase stalls the other. Research systems such as DistServe, Splitwise, and Mooncake demonstrated that this separation can substantially improve goodput, the rate of requests served within latency targets, compared to monolithic serving. The cost is a network transfer of the intermediate state between pools, which is modest on fast interconnects.
Relevance at scale
Disaggregation mainly pays off in multi-node deployments where there are enough GPUs to dedicate to each phase, so it is more a hashcenter-scale and cluster technique than a single-card concern. For a sovereign operator scaling beyond one machine, it is the architectural pattern that lets prefill-heavy and decode-heavy traffic be provisioned and tuned separately instead of compromising on a single one-size GPU configuration.
This separation builds on the same primitives as single-node serving: the transferred state is the KV cache, and each pool still runs continuous batching internally to keep its GPUs saturated.
In Simple Terms
Prefill-decode disaggregation is a serving architecture that runs the two phases of language-model inference on physically separate hardware. The prefill phase, which ingests the prompt,…
