Definition
Throughput versus latency is the central trade-off when serving large language models (LLMs). Latency measures how fast one request is served — time to first token (TTFT) for the first response, and inter-token latency for each token after. Throughput measures total work across the whole system, usually as tokens per second summed over every concurrent request. The two pull in opposite directions: choices that maximize aggregate throughput often make any individual request feel slower, and vice versa.
Why they conflict
Packing more requests into a batch keeps the GPU busy and raises throughput, but each request now shares compute and memory bandwidth, so its own latency rises. The decode phase is memory-bound, so a single request rarely saturates a big GPU on its own — batching is how you reclaim that idle capacity. Continuous batching narrows the gap by swapping finished requests out and new ones in, lifting throughput with a smaller latency penalty than naive static batching.
Choosing a target
Interactive chat is latency-sensitive: a low TTFT and snappy token stream matter more than raw volume. Offline or bulk jobs — summarizing an archive, classifying a corpus — are throughput-sensitive, where total tokens per hour is what counts and per-request delay is irrelevant. Knowing which side you are on tells you how to size hardware and configure batch size.
For sovereign Bitcoiners self-hosting inference, this trade-off decides whether you tune your rig for a responsive personal assistant or a high-volume batch pipeline. See time to first token (TTFT) and batch size (inference).
Balance the two in the inference cost calculator.
In Simple Terms
Throughput versus latency is the central trade-off when serving large language models (LLMs). Latency measures how fast one request is served — time to first…
