Definition
Tokens per second (TPS) is the primary metric for how fast a large language model produces text. It is simply the number of tokens generated divided by the generation time. Since a token is roughly three-quarters of an English word, TPS tells you, in practical terms, how quickly answers appear on screen, and whether a given model-and-hardware combination feels snappy or sluggish.
Throughput vs. Latency
TPS measures throughput, the total output rate, but two related latency metrics matter just as much for interactive use. Time To First Token (TTFT) is the delay before the first word appears, dominated by processing your prompt. Time Per Output Token (TPOT), the decode time for each subsequent token, governs how smoothly text streams afterward. As a rule of thumb, around 6 tokens per second matches a typical human reading speed, so anything above that feels comfortably real-time for a single user.
What Limits It Locally
On self-owned hardware, generation speed is usually bound by memory bandwidth rather than raw compute, because each new token requires reading the entire model and growing KV cache from memory. Bigger models, longer contexts, and higher precision all push TPS down. This is why measured tokens per second, not a chip's theoretical peak, is the honest benchmark when sizing a rig.
Tokens per second is the number to watch when choosing hardware for a Local LLM, and it is directly shaped by the KV Cache that grows with every token.
Estimate your throughput in the inference cost calculator.
In Simple Terms
Tokens per second (TPS) is the primary metric for how fast a large language model produces text. It is simply the number of tokens generated…
