Time To First Token (TTFT)

Sovereign AI

Time to first token (TTFT) measures how long a large language model (LLM) takes to produce its very first output token after a request arrives. It captures the queuing or scheduling delay plus the entire prefill phase — the parallel processing of the prompt that builds the key-value (KV) cache. TTFT is the metric users feel as "how long until it starts answering," and it is the single most important latency number for interactive, streaming applications like chat: a model can generate quickly once started, but if the first token takes eight seconds, the experience is already broken.

What drives TTFT

Because TTFT is essentially the latency of prefill, anything that lengthens prefill lengthens TTFT: a longer prompt, a large system prompt, retrieved documents from a RAG pipeline, or simply a bigger context window filled with conversation history. Prefill is compute-bound and its attention cost grows roughly with the square of prompt length, so TTFT can be far larger than the time for any single later token. Hardware matters in a specific way: prefill loves raw compute throughput, while the decode phase that follows is limited mostly by memory bandwidth — so two GPUs with similar VRAM can post very different TTFT numbers. On a shared self-hosted box, scheduling delay also counts: a request waiting behind others adds every millisecond of that wait to its measured TTFT.

TTFT versus per-token latency

TTFT should not be confused with the time between subsequent tokens (often called TPOT, or inter-token latency), which measures the decode phase. A system can have fast TTFT but slow generation, or the reverse, so honest benchmarks report both — plus end-to-end latency for a full response. Optimizations target them differently: chunked prefill and prefix caching specifically shrink TTFT by reducing or reusing prefill work, while quantization and batching mostly help decode throughput. Prefix caching deserves special attention for self-hosters: if your assistant reuses the same long system prompt on every request, caching that prefix can turn a multi-second TTFT into near-instant response.

Tuning TTFT on a sovereign rig

For sovereign Bitcoiners running models locally through Ollama or llama.cpp, TTFT tells you how responsive your rig feels before throughput even matters. Practical levers, in rough order of impact: keep the model fully in GPU memory (any layers offloaded to CPU inflate prefill dramatically); trim what you actually send — a lean system prompt and tight RAG retrieval beat a kitchen-sink context; enable prompt/prefix caching in your serving stack; and measure with your own prompts, since a benchmark run on a 50-token prompt says nothing about your 8,000-token workflow. The pattern will feel familiar to anyone who tunes miners: identify the bound resource, measure honestly, change one variable at a time. See the prefill phase that TTFT measures, throughput vs latency for the broader trade-off, and inference for where both fit in the model lifecycle.

Measuring it honestly

TTFT is easy to measure badly. Streaming must be enabled end to end, or you will measure full-response latency and call it TTFT; the first chunk's arrival time is the datum. Cold starts deserve separate books: the first request after model load includes weight paging and cache allocation that later requests never pay, so report warm and cold numbers distinctly. Client-side overhead — TLS setup, proxies, a browser rendering loop — rides along in what users feel, so measure at the interface your users actually touch. And percentiles beat averages: a rig whose median TTFT is 400 ms but whose p95 is six seconds has a scheduling or memory-pressure problem that an average would politely hide. The habit transfers directly from mining telemetry: trust distributions, not single numbers.

Estimate latency in the inference cost calculator.

Time to first token (TTFT) measures how long a large language model (LLM) takes to produce its very first output token after a request arrives.…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners