Definition
Time to first token (TTFT) measures how long a large language model (LLM) takes to produce its very first output token after a request arrives. It captures the queuing or scheduling delay plus the entire prefill phase — the parallel processing of the prompt that builds the key-value (KV) cache. TTFT is the metric users feel as "how long until it starts answering," and it is the single most important latency number for interactive, streaming applications like chat.
What drives TTFT
Because TTFT is essentially the latency of prefill, anything that lengthens prefill lengthens TTFT: a longer prompt, a large system prompt, retrieved context, or a bigger context window. Prefill is compute-bound and grows roughly with the square of prompt length, so TTFT can be far larger than the time for any single later token. On a shared self-hosted box, scheduling delay also counts — a request waiting behind others adds to its measured TTFT.
TTFT versus per-token latency
TTFT should not be confused with the time between subsequent tokens (sometimes called TPOT or inter-token latency), which measures the decode phase. A system can have fast TTFT but slow generation, or vice versa, so honest benchmarks report both. Optimizations like chunked prefill and prefix caching specifically target TTFT by shrinking or reusing prefill work.
For sovereign Bitcoiners running models locally, TTFT tells you how responsive your rig feels before throughput even matters. See the prefill phase that TTFT measures and throughput vs latency for the broader picture.
Estimate latency in the inference cost calculator.
In Simple Terms
Time to first token (TTFT) measures how long a large language model (LLM) takes to produce its very first output token after a request arrives.…
