Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Throughput-Optimized Serving

Sovereign AI

Definition

Throughput-optimized serving is a way of configuring a large language model inference server to maximize the total number of tokens produced per second across every concurrent user, rather than minimizing the response time of any single request. It sits at one end of the fundamental throughput-versus-latency trade-off that governs all LLM serving.

Why the trade-off exists

Generating text has two phases with opposite characteristics. Prefill processes the whole prompt at once and saturates GPU compute. Decode emits one token at a time and is memory-bound, leaving the GPU underused unless many requests are batched together. Batching more requests raises throughput because the expensive weight loads are shared, but it also lengthens each step, so individual users wait longer between tokens. Larger batches mean higher throughput and higher latency; smaller batches mean the reverse.

Goodput, not raw throughput

A purely throughput-maximizing server can violate users' latency expectations. The metric that matters in practice is goodput: the highest request rate the server can sustain while still meeting its service-level objectives on time to first token and time per output token. Techniques like chunked prefill, dynamic batch sizing, and disaggregating prefill from decode let an operator push throughput up without blowing past those latency targets.

For a sovereign operator with one or a few GPUs, choosing a throughput-optimized profile makes sense for batch jobs, document processing, and offline pipelines, while interactive chat favours a latency-leaning configuration. The right setting is workload-dependent. See in-flight batching and request scheduling for the mechanisms that implement this balance.

In Simple Terms

Throughput-optimized serving is a way of configuring a large language model inference server to maximize the total number of tokens produced per second across every…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners