Definition
Throughput-optimized serving is a way of configuring a large language model inference server to maximize the total number of tokens produced per second across every concurrent user, rather than minimizing the response time of any single request. It sits at one end of the fundamental throughput-versus-latency trade-off that governs all LLM serving.
Why the trade-off exists
Generating text has two phases with opposite characteristics. Prefill processes the whole prompt at once and saturates GPU compute. Decode emits one token at a time and is memory-bound, leaving the GPU underused unless many requests are batched together. Batching more requests raises throughput because the expensive weight loads are shared, but it also lengthens each step, so individual users wait longer between tokens. Larger batches mean higher throughput and higher latency; smaller batches mean the reverse.
Goodput, not raw throughput
A purely throughput-maximizing server can violate users' latency expectations. The metric that matters in practice is goodput: the highest request rate the server can sustain while still meeting its service-level objectives on time to first token and time per output token. Techniques like chunked prefill, dynamic batch sizing, and disaggregating prefill from decode let an operator push throughput up without blowing past those latency targets.
For a sovereign operator with one or a few GPUs, choosing a throughput-optimized profile makes sense for batch jobs, document processing, and offline pipelines, while interactive chat favours a latency-leaning configuration. The right setting is workload-dependent. See in-flight batching and request scheduling for the mechanisms that implement this balance.
In Simple Terms
Throughput-optimized serving is a way of configuring a large language model inference server to maximize the total number of tokens produced per second across every…
