Definition
The decode phase is the second stage of large language model (LLM) inference, following prefill. Here the model produces output one token at a time. Each step feeds the most recently generated token back in, reads the cached attention states for everything before it, computes a single new step, appends its result to the key-value (KV) cache, and samples the next token. This autoregressive loop repeats until the model emits a special end-of-sequence token or hits a length limit.
Why decode is memory-bound
Unlike prefill, decode does very little arithmetic per step but must stream the entire growing KV cache from GPU memory on every token. That makes decode memory-bandwidth-bound: throughput is limited by how fast the hardware can move cached state, not by raw compute. This is why a card with modest FLOPS but fast memory can still generate text quickly, and why the KV cache's size — which grows with context length and batch size — becomes the binding constraint on a self-hosted rig.
Decode and perceived speed
The time between successive tokens in the decode phase is what users feel as the model's "typing speed." Serving techniques such as continuous batching and PagedAttention exist largely to keep the GPU busy and the cache compact during decode, raising tokens-per-second across many concurrent requests.
For sovereign Bitcoiners running inference locally, the decode phase is where memory capacity and bandwidth decide how long a context you can hold and how fast you generate. See the prefill phase for the preceding stage and throughput vs latency for the trade-offs that shape it.
In Simple Terms
The decode phase is the second stage of large language model (LLM) inference, following prefill. Here the model produces output one token at a time.…
