Decode Phase

Sovereign AI

The decode phase is the second stage of large language model (LLM) inference, following prefill. Here the model produces output one token at a time. Each step feeds the most recently generated token back in, reads the cached attention states for everything before it, computes a single forward step, appends its result to the key-value (KV) cache, and samples the next token. This autoregressive loop repeats until the model emits a special end-of-sequence token or hits a length limit. Prefill is a sprint over the whole prompt at once; decode is the long walk that follows, one step per visible word.

Why decode is memory-bound

Unlike prefill, which processes every prompt token in parallel and saturates the GPU's arithmetic units, decode does very little math per step — one token's worth — but must stream the model's weights and the entire growing KV cache from memory on every single token. That makes decode memory-bandwidth-bound: throughput is limited by how fast the hardware can move bytes, not by raw compute. The practical consequences are counterintuitive until you internalize them. A card with modest FLOPS but fast memory can generate text quickly; a compute monster with slow memory cannot. Rough single-user generation speed is approximately memory bandwidth divided by the bytes that must be read per token, which is why quantization speeds up decode almost linearly — a 4-bit GGUF model moves a quarter of the bytes of its 16-bit original, so the same memory bus delivers roughly four times the tokens.

The KV cache is the binding constraint

The KV cache — the stored attention keys and values for every token processed so far — is what makes decode affordable at all: without it, each new token would require recomputing attention over the entire history. But the cache grows linearly with context length and with the number of concurrent requests, and it lives in the same VRAM as the weights. On a self-hosted rig this is the budget line that decides how long a context window you can actually hold and how many simultaneous sessions you can serve. Serving stacks attack the problem from several angles: continuous batching keeps the GPU busy by interleaving many requests' decode steps, PagedAttention allocates cache memory in pages to eliminate fragmentation, and KV-cache quantization shrinks the cache itself.

Decode and perceived speed

The time between successive tokens in the decode phase is what users feel as the model's "typing speed," while prefill determines the pause before the first word appears. For a single-user sovereign setup — one person chatting with a llama.cpp or Ollama instance — decode throughput in the range of human reading speed is comfortable, and the tuning levers that matter are quantization level, context length, and how much of the model fits in fast memory versus spilling to system RAM. Spillover is the classic self-hosting failure mode: a model that almost fits in VRAM decodes at a crawl because every token drags weights across the PCIe bus.

Accelerating decode

Because the bottleneck is bytes, most decode accelerations attack memory traffic. Speculative decoding uses a small draft model to propose several tokens cheaply, then has the large model verify the batch in a single parallel pass — accepted tokens cost one big-model step instead of several, converting some of decode's serial walk back into parallel work. Architectural changes help too: grouped-query attention shares key-value heads to shrink the KV cache, and sliding-window schemes bound how much history each layer must read. None of this changes the fundamentals — faster memory, fewer bytes per token, or cleverer reuse are the only three doors out.

For sovereign Bitcoiners running inference locally, the decode phase is where memory capacity and bandwidth decide what your hardware is really worth. See the prefill phase for the preceding stage and throughput vs latency for the serving trade-offs that shape both.

Estimate token output in the inference cost calculator.

The decode phase is the second stage of large language model (LLM) inference, following prefill. Here the model produces output one token at a time.…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Glossaire du minage

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Comparer les mineurs