Batch Size (Inference)

Sovereign AI

Batch size in LLM inference is the number of requests or sequences the model processes together in a single forward pass. Because the decode phase is memory-bandwidth-bound, a lone request rarely uses all of a GPU's compute — the hardware spends much of its time waiting on memory rather than doing math. Grouping several requests into one batch reuses the same weight loads across many sequences, so total throughput (tokens per second summed across all requests) climbs sharply as batch size grows, up to a saturation point where the GPU finally becomes compute-bound.

The cost of bigger batches

Throughput gains are not free. Every sequence in a batch needs its own key-value (KV) cache, so memory consumption grows with batch size and eventually becomes the hard limit on how many requests fit alongside the model weights in VRAM. Larger batches also raise per-request latency, since each request shares compute and bandwidth with its neighbors — the tokens-per-second an individual user experiences drops even as the aggregate rises. Beyond a certain size the returns diminish and latency degrades faster than throughput improves, so there is a sweet spot rather than a "bigger is always better" rule. Long contexts make the squeeze worse: a batch of eight requests each carrying a large context window can demand more KV-cache memory than the model weights themselves.

Static vs continuous batching

Naive static batching fixes the group at launch and wastes capacity when sequences finish at different times, leaving GPU slots idle until the longest sequence completes. Continuous batching swaps completed requests out and new ones in at the token level, keeping the batch full and lifting utilization dramatically — it is the standard approach in modern serving engines and the main reason a well-configured server can serve many users from one card.

What it means for a self-hosted stack

Prefill, decode, and why batching helps one more than the other

Batching pays out differently across the two phases of a request. The prefill phase — ingesting the prompt — processes many tokens in parallel and is already compute-heavy even for one request, so batching adds less there. The decode phase generates one token per sequence per step, touching every weight to produce a sliver of output; this is the memory-bound regime where batching shines, because eight sequences decode for nearly the price of one. That split explains the classic serving symptom: time-to-first-token stretches as batches deepen (prefills queue behind compute), while steady-state generation stays healthy. It also explains why mixed workloads are tricky — one user pasting a huge document can stall the first token for everyone sharing the batch. Serving engines mitigate this with chunked prefill and scheduling policies, but the physics is worth knowing before you chase a config fix: a batch is a shared pipe, and both phases have to fit through it.

For a sovereign operator running local inference, batch size is the main dial for trading responsiveness against total volume on fixed GPU memory. A single-user desktop setup running llama.cpp or a similar runtime effectively lives at batch size one, where memory bandwidth — not raw compute — decides your tokens per second; this is why a modest GPU with fast memory can feel quicker in chat than a bigger, slower-memory card. A homelab box serving a family or a small team benefits enormously from batching: the second, third, and fourth simultaneous requests are nearly free in compute terms because the weights are being read from memory anyway. The tuning workflow is empirical: pick the largest batch that fits your VRAM after weights and KV cache, measure per-request latency, and back off until interactive use feels right. Quantized weights (see LLM quantization) free up memory that can go directly into batch headroom, which is one of the quieter reasons 4-bit models are popular for multi-user self-hosting. See throughput vs latency and the decode phase it most affects.

See the throughput trade-off in the inference cost calculator.

Batch size in LLM inference is the number of requests or sequences the model processes together in a single forward pass. Because the decode phase…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners