Definition
Continuous batching is a scheduling technique for serving large language models that dramatically improves how many requests a GPU can handle at once. Traditional (static) batching waits for every request in a batch to finish before starting the next batch, which wastes GPU time because requests finish at different lengths. Continuous batching instead makes its decisions at each generation step — also called iteration-level scheduling — admitting new requests into the active batch as soon as a slot frees up and retiring completed ones immediately. The GPU rarely sits idle.
Why throughput jumps
Autoregressive generation leaves modern GPUs underutilised when handling one request at a time. By continuously packing in-flight requests together, continuous batching keeps GPU occupancy high during decoding. Serving engines built around it report large throughput gains — on the order of 10-20x over naive batching in published benchmarks — while also lowering tail latency under bursty load.
Relevance to self-hosting
If you run a local model that serves more than one user or application — a household, a small team, or several agents — continuous batching is what lets a single GPU keep up without buying more hardware. It is a core feature of popular open-source inference servers and usually works automatically once enabled. This is the practical, modern form of what is loosely called 'batch inference.'
It composes with other serving optimisations such as flash attention and speculative decoding; all of them serve faster, more efficient local inference.
In Simple Terms
Continuous batching is a scheduling technique for serving large language models that dramatically improves how many requests a GPU can handle at once. Traditional (static)…
