Tokens per Second

Sovereign AI

Tokens per second (TPS) is the primary metric for how fast a large language model produces text: the number of tokens generated divided by the time it took to generate them. Since a token averages roughly three-quarters of an English word, TPS tells you, in practical terms, how quickly answers appear on screen — and whether a given model-and-hardware combination feels snappy or sluggish. When you are sizing a rig for local inference, measured tokens per second is the honest number; everything else on the spec sheet is a proxy.

Throughput versus latency

TPS measures throughput, but two latency metrics matter just as much for interactive use. Time to first token (TTFT) is the pause before anything appears, dominated by processing your prompt — long prompts and big context windows make it grow. Time per output token (TPOT) governs how smoothly text streams after that. Two setups with identical average TPS can feel completely different if one makes you wait seconds before the first word. As a rule of thumb, sustained generation around 6 tokens per second matches comfortable human reading speed, so anything above that feels real-time for a single user; agents and batch jobs, which consume their own output, benefit from far more.

What limits TPS on your own hardware

On self-owned hardware, token generation is usually memory-bandwidth bound, not compute bound: producing each new token requires streaming essentially all of the model's weights, plus the growing KV cache, through the processor. That is why a GPU's memory bandwidth predicts generation speed better than its FLOPS, why models that fit entirely in VRAM massively outperform ones spilling to system RAM, and why the moment a model exceeds VRAM and layers fall back to CPU, TPS falls off a cliff. Bigger models, longer contexts, and higher precision all push TPS down for the same reason: more bytes to move per token.

Levers you control

The biggest lever is quantization: a 4-bit quantized model moves roughly a quarter of the bytes of its 16-bit original, which translates directly into more tokens per second and more headroom before spilling out of VRAM — usually at a modest quality cost. Beyond that: pick a model size that genuinely fits your card, keep prompts as short as the task allows, cap the context you allocate in your runner, and use inference engines with efficient attention and cache management. Speculative decoding, where a small draft model proposes tokens a large model verifies, can add real speedups on supported stacks.

Benchmarking honestly

Measure on your own machine with your own workloads: the same model can differ severalfold in TPS between runners, quantization levels, and context lengths. Report prompt length, generated length, and quantization alongside any number, or it is not comparable. Tokens per second is the figure of merit when choosing hardware for a local LLM — treat vendor numbers the way a miner treats a used ASIC listing: verify against your own bench before you rely on it.

Estimate your throughput in the inference cost calculator.

Tokens per second (TPS) is the primary metric for how fast a large language model produces text: the number of tokens generated divided by the…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners