Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

In-Flight Batching

Sovereign AI

Definition

In-flight batching is a request-scheduling technique used by self-hosted LLM inference servers to keep a GPU fully occupied while generating text. In a naive server, a batch of requests starts together and the GPU waits for the slowest one to finish before accepting any new work. In-flight batching (the term NVIDIA's TensorRT-LLM uses; vLLM and others call the closely related idea continuous batching) instead manages the batch at the granularity of a single generation step. After each token is produced, a completed sequence is evicted from the batch and a freshly arrived request is slotted into the vacated space.

Why it matters for self-hosting

Generated sequences differ wildly in length. Without in-flight batching, a batch of ten requests where nine finish quickly stalls until the tenth long answer completes, wasting most of the GPU. Evicting and refilling per step removes that head-of-line blocking, so throughput climbs and the average time a request waits in the queue falls. For a sovereign operator running inference on their own hardware, this is the difference between a GPU that idles at 30% and one that stays saturated.

The trade-offs

Because requests in different phases coexist in the same batch, in-flight batching is usually paired with paged key-value caching so that memory for each sequence can be allocated and freed independently. It also interleaves the compute-heavy prefill phase with the memory-bound decode phase, which the scheduler must balance to avoid latency spikes for users already mid-response.

In-flight batching is one of several inference-serving optimizations a self-hoster combines when running models locally. See our entry on throughput-optimized serving for the broader latency-versus-throughput picture, and request scheduling for how the server decides which work to admit next.

In Simple Terms

In-flight batching is a request-scheduling technique used by self-hosted LLM inference servers to keep a GPU fully occupied while generating text. In a naive server,…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners