Definition
In-flight batching is a request-scheduling technique used by self-hosted LLM inference servers to keep a GPU fully occupied while generating text. In a naive server, a batch of requests starts together and the GPU waits for the slowest one to finish before accepting any new work. In-flight batching (the term NVIDIA's TensorRT-LLM uses; vLLM and others call the closely related idea continuous batching) instead manages the batch at the granularity of a single generation step. After each token is produced, a completed sequence is evicted from the batch and a freshly arrived request is slotted into the vacated space.
Why it matters for self-hosting
Generated sequences differ wildly in length. Without in-flight batching, a batch of ten requests where nine finish quickly stalls until the tenth long answer completes, wasting most of the GPU. Evicting and refilling per step removes that head-of-line blocking, so throughput climbs and the average time a request waits in the queue falls. For a sovereign operator running inference on their own hardware, this is the difference between a GPU that idles at 30% and one that stays saturated.
The trade-offs
Because requests in different phases coexist in the same batch, in-flight batching is usually paired with paged key-value caching so that memory for each sequence can be allocated and freed independently. It also interleaves the compute-heavy prefill phase with the memory-bound decode phase, which the scheduler must balance to avoid latency spikes for users already mid-response.
In-flight batching is one of several inference-serving optimizations a self-hoster combines when running models locally. See our entry on throughput-optimized serving for the broader latency-versus-throughput picture, and request scheduling for how the server decides which work to admit next.
In Simple Terms
In-flight batching is a request-scheduling technique used by self-hosted LLM inference servers to keep a GPU fully occupied while generating text. In a naive server,…
