Throughput vs Latency (LLM Inference)

Sovereign AI

Throughput versus latency is the central trade-off when serving large language models (LLMs). Latency measures how fast one request is served — time to first token (TTFT) for the initial response, and inter-token latency for each token after. Throughput measures total work across the whole system, usually as tokens per second summed over every concurrent request. The two pull in opposite directions: choices that maximize aggregate throughput often make any individual request feel slower, and vice versa. Every serving stack, from a hobbyist's single GPU to a datacenter cluster, is an answer to the question of where on this curve to sit.

Why they conflict

The tension comes from how inference uses hardware. The decode phase is memory-bandwidth-bound, so a single request rarely saturates a large GPU on its own — most of the chip's arithmetic capacity sits idle while the memory bus streams weights and KV cache. Packing more requests into a batch amortizes each weight read across many tokens, which is why batching is the single biggest throughput lever in LLM serving. But each request in the batch now shares compute and memory bandwidth, so its own token stream slows, and requests that arrive mid-batch may queue. Continuous batching narrows the gap by swapping finished requests out and admitting new ones between decode steps, lifting throughput with a far smaller latency penalty than naive static batching — it is the reason modern serving engines dramatically outperform simple sequential loops on multi-user workloads.

Choosing a target

The right operating point depends entirely on the workload. Interactive chat is latency-sensitive: a low TTFT and a token stream at or above human reading speed matter more than raw volume, and a snappy small model often beats a sluggish large one for perceived quality. Offline and bulk jobs — summarizing an archive, classifying a corpus, generating embeddings for a knowledge base — are throughput-sensitive: total tokens per hour is what counts, per-request delay is irrelevant, and you should batch as aggressively as memory allows. Knowing which side you are on tells you how to size hardware, pick quantization levels, and configure batch limits. Mixed workloads are the hard case; the usual answer is separating interactive and batch traffic rather than letting a bulk job sit in front of a human.

The self-hosting angle

For sovereign Bitcoiners self-hosting inference, this trade-off decides whether you tune your rig as a responsive personal assistant or a high-volume batch pipeline — and the good news is that a single-user setup lives on the easy end of the curve. One person's chat session needs no batching heroics: a modest GPU with adequate VRAM running llama.cpp or Ollama delivers excellent latency precisely because it is not sharing the memory bus with anyone. The calculus changes when a household or small operation shares one box, or when you schedule overnight batch work — indexing documents for a RAG pipeline is a throughput job that happily runs while you sleep. Measure both numbers on your own hardware before buying anything: TTFT, tokens per second per request, and aggregate tokens per second under realistic concurrency tell you more than any spec sheet.

Rules of thumb

For interactive chat, aim for a first token within about a second and a stream at or above comfortable reading speed — roughly 10 tokens per second and up feels fluid.
For batch work, ignore per-request numbers entirely; measure tokens per hour and cost per million tokens on your own hardware.
Never let bulk jobs share a queue with interactive users — separate them by schedule or by instance.
Before buying hardware for more throughput, try a smaller or more aggressively quantized model first; it is the cheapest point on the entire trade-off curve.

See time to first token (TTFT) and batch size (inference) for the component metrics, and the decode phase for why memory bandwidth sits underneath all of it.

Balance the two in the inference cost calculator.

Throughput versus latency is the central trade-off when serving large language models (LLMs). Latency measures how fast one request is served — time to first…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners