Definition
Goodput is the performance metric that tells a large language model operator how much useful work their server can actually deliver. It is defined as the maximum request rate the server can sustain while still meeting its service-level objectives (SLOs) on latency. Raw throughput counts every token the GPU produces; goodput counts only the requests that arrive within their promised time budget. A server can post impressive throughput numbers while delivering poor goodput if those tokens come too slowly for users to tolerate.
Why goodput beats throughput
The two latencies that define an SLO are time to first token (TTFT), how long a user waits before output begins, and time per output token (TPOT), the streaming pace thereafter. When an operator pushes batch sizes up to maximize throughput, both latencies rise, and at some point requests start missing their SLO. Those requests still consume GPU cycles but no longer count as goodput, because a too-slow answer is, for an interactive user, effectively a failed one. Goodput captures exactly this distinction between busy and productive.
Tuning for goodput
Optimizing goodput means finding the operating point where the server is as loaded as possible without breaking its latency promise. Techniques that raise the ceiling include chunked prefill, prefill-decode disaggregation, dynamic batch sizing, and priority-aware scheduling that protects deadline-sensitive requests. The right target SLO is a business decision: a sovereign operator running a personal assistant can set tighter latency goals than one running an overnight batch pipeline.
Goodput is the metric a self-hoster should watch when sizing hardware, because it reflects real user experience rather than benchmark peaks. See throughput-optimized serving for the trade-off it measures and request scheduling for the policies that defend it.
In Simple Terms
Goodput is the performance metric that tells a large language model operator how much useful work their server can actually deliver. It is defined as…
