In-Flight Batching

Sovereign AI

In-flight batching is a request-scheduling technique used by self-hosted LLM inference servers to keep a GPU fully occupied while generating text. In a naive server, a batch of requests starts together and the GPU waits for the slowest one to finish before accepting any new work. In-flight batching — the term NVIDIA's TensorRT-LLM uses; vLLM and others call the closely related idea continuous batching — instead manages the batch at the granularity of a single generation step. After each token is produced, any completed sequence is evicted from the batch and a freshly arrived request is slotted into the vacated space, so the batch is perpetually draining and refilling rather than marching in lockstep.

Why it matters for self-hosting

Generated sequences differ wildly in length: one user asks for a yes/no answer while another requests a thousand-word explanation. Without in-flight batching, a batch of ten requests where nine finish quickly stalls until the tenth long answer completes, wasting most of the GPU on padding. Per-step eviction and refill removes that head-of-line blocking, so throughput climbs sharply and the average time a request spends queued falls with it. For a sovereign operator running inference on owned hardware, this is the difference between a GPU that idles at a fraction of capacity and one that stays saturated — which directly translates into how many users, agents, or background jobs one machine can serve. The technique is standard in mainstream serving engines, so a home lab gets it by choosing the right server software rather than writing scheduler code.

The trade-offs

Because requests in different phases coexist in the same batch, in-flight batching is usually paired with paged key-value caching, so memory for each sequence can be allocated and freed in small blocks independently — without it, fragmentation of the KV cache in VRAM would undo much of the gain. The scheduler must also interleave the compute-heavy prefill phase of new arrivals with the memory-bound decode phase of running sequences; done carelessly, admitting a request with a huge prompt causes a visible latency stutter for every user already mid-response. Techniques like chunked prefill split large prompts into pieces to smooth this out. There is also a soft ceiling: every admitted sequence consumes KV-cache memory for its whole context, so batch depth is ultimately bounded by VRAM, not ambition.

Where it fits in your stack

If you run a local LLM for a single user typing one question at a time, in-flight batching buys little — there is no queue to optimize. It becomes decisive the moment concurrency appears: a family or small shop sharing one inference box, an agent pipeline firing parallel requests, or a miner-monitoring assistant summarizing logs in the background while you chat in the foreground. It is one of several serving optimizations a self-hoster combines; see throughput-optimized serving for the broader latency-versus-throughput picture, and request scheduling for how the server decides which work to admit next.

Tuning it in practice

Serving engines expose the batching machinery through a handful of knobs worth understanding. A maximum-concurrent-sequences setting caps how many requests share the GPU at once: raise it and throughput climbs until KV-cache memory runs out or per-user streaming speed degrades below comfort; lower it and latency tightens while the queue grows. A memory-utilization target controls how much VRAM the server pre-reserves for cache pages, and watching the reported cache usage under real traffic tells you whether you are context-bound or compute-bound. The practical method is empirical: replay a realistic mix of short and long requests, watch time-to-first-token and tokens-per-second per user, and move one knob at a time. Most home-lab disappointments with "slow" local serving trace back to defaults tuned for datacenter GPUs — a half hour of deliberate tuning on your actual hardware routinely recovers a large fraction of the machine's real capacity.

In-flight batching is a request-scheduling technique used by self-hosted LLM inference servers to keep a GPU fully occupied while generating text. In a naive server,…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners