Prefill-Decode Disaggregation

Sovereign AI

Prefill-decode disaggregation is a serving architecture that runs the two phases of language-model inference on physically separate hardware. The prefill phase, which ingests the prompt, is compute-bound: it does heavy parallel matrix work in one burst. The decode phase, which emits tokens one at a time, is memory-bound: it does little arithmetic but constantly reads stored state from memory. Packing both onto the same GPU forces them to fight over resources and interfere with each other's latency.

Separating the phases

Disaggregation assigns prefill and decode to different GPU pools. Prompts are processed on prefill nodes chosen for raw compute throughput; the resulting attention state is transferred to decode nodes chosen for large, high-bandwidth memory, where token generation proceeds undisturbed. Each pool can be scaled, batched, and parallelized independently, and neither phase stalls the other. Research systems such as DistServe, Splitwise, and Mooncake demonstrated that this separation can substantially improve goodput, the rate of requests served within latency targets, compared to monolithic serving. The cost is a network transfer of the intermediate state between pools, which is modest on fast interconnects.

Relevance at scale

Disaggregation mainly pays off in multi-node deployments where there are enough GPUs to dedicate to each phase, so it is more a hashcenter-scale and cluster technique than a single-card concern. For a sovereign operator scaling beyond one machine, it is the architectural pattern that lets prefill-heavy and decode-heavy traffic be provisioned and tuned separately instead of compromising on a single one-size GPU configuration.

This separation builds on the same primitives as single-node serving: the transferred state is the KV cache, and each pool still runs continuous batching internally to keep its GPUs saturated.

Prefill-decode disaggregation is a serving architecture that runs the two phases of language-model inference on physically separate hardware. The prefill phase, which ingests the prompt,…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners

Prefill-Decode Disaggregation

Definition

Separating the phases

Relevance at scale

In Simple Terms

Explore the Full Glossary

ASIC Miner Database