Definition
Batch size in LLM inference is the number of requests or sequences the model processes together in a single forward pass. Because the decode phase is memory-bandwidth-bound, a lone request rarely uses all of a GPU's compute — the hardware spends much of its time waiting on memory. Grouping several requests into one batch reuses the same weight loads across many sequences, so total throughput (tokens per second across all requests) climbs sharply as batch size grows, up to a saturation point.
The cost of bigger batches
Throughput gains are not free. Every sequence in a batch needs its own key-value (KV) cache, so memory use grows with batch size and eventually becomes the limit on how many requests fit. Larger batches also raise per-request latency, since each request shares compute and bandwidth with its neighbors. Beyond a certain size the returns diminish and latency degrades faster than throughput improves, so there is a sweet spot rather than a "bigger is always better" rule.
Static vs continuous batching
Naive static batching fixes the group at launch and wastes capacity when sequences finish at different times, leaving the GPU idle. Continuous batching swaps completed requests out and new ones in at the token level, keeping the batch full and lifting utilization — the standard approach in modern serving engines.
For sovereign Bitcoiners self-hosting inference, batch size is the main dial for trading responsiveness against total volume on fixed GPU memory. See throughput vs latency and the decode phase it most affects.
See the throughput trade-off in the inference cost calculator.
In Simple Terms
Batch size in LLM inference is the number of requests or sequences the model processes together in a single forward pass. Because the decode phase…
