Request Scheduling (LLM Serving)

Sovereign AI

Request scheduling is the logic inside a large language model inference server that decides, at each step, which queued requests enter the running batch, which are preempted, and in what order prompts are processed. GPU memory and compute are finite; the scheduler is what turns a pile of concurrent requests into a stream of tokens that respects each user's latency expectations while keeping the hardware busy. It is the closest thing an inference server has to an operating system kernel, and it borrows many of the same ideas.

The metrics it balances

Two latencies dominate. Time to first token (TTFT) measures how long a user waits after sending a request before any output appears; it is driven mostly by queueing delay plus the prefill pass over the prompt. Time per output token (TPOT), sometimes called time between tokens, measures the pace of streaming once generation starts. The two trade against raw throughput: packing the GPU with as much work as possible maximizes tokens per second across all users, but can make any individual user's stream stutter. A scheduler tries to honour targets on both latencies while pushing system throughput as high as the hardware allows — and the right balance differs between an interactive chat box and a batch job summarizing a thousand documents overnight.

Common policies

Naive first-come-first-served suffers head-of-line blocking: one request with an enormous prompt stalls everyone behind it while its prefill monopolizes the GPU. Practical servers respond with classic systems tricks. Shortest-job-first ordering of prefills cuts average TTFT by letting quick requests slip past heavyweights. Least-slack-time-first protects deadlines when requests carry service-level objectives. Chunked prefill breaks a huge prompt into pieces so it interleaves with ongoing decodes instead of freezing them. Multi-priority schemes let latency-sensitive traffic jump ahead of best-effort background work, and preemption can evict a running sequence — swapping its key-value cache out of VRAM or recomputing it later — when something more urgent arrives. All of this operates hand in glove with in-flight batching, which supplies the per-step admission points the scheduler exploits.

Why a self-hoster should care

On owned hardware, the scheduler is the lever that converts one GPU into a predictable, fair service. A home-lab box serving a local LLM to a family, a workshop, or a fleet of agents will feel radically different depending on scheduler settings: cap concurrent sequences too high and every chat crawls as decode slots contend; too low and the queue builds while the GPU idles. Sensible starting points are to separate interactive from batch traffic by priority, enable chunked prefill if your server supports it, and watch TTFT under real load rather than benchmarks. The scheduler is also where global limits are enforced — see token budget and rate limiting for how operators keep any single user or runaway agent from starving everyone else. Sovereignty means the queue discipline is yours to set, which is exactly why it pays to understand it.

Fairness and starvation

Every clever policy creates a new failure mode, and the classic one is starvation. Pure shortest-job-first is optimal on paper and cruel in practice: a user with a legitimately large prompt can wait indefinitely while a stream of quick requests perpetually jumps the queue. Production schedulers temper this with aging — a request's effective priority rises the longer it waits — so heavyweights are delayed but never abandoned. Per-user fairness is the other half: without it, one aggressive client (often a runaway script or an over-eager agent loop) can occupy every decode slot while everyone else stalls. Round-robin admission across users, per-identity concurrency caps, and separate queues for interactive versus batch traffic are the standard remedies. On a shared home-lab box these problems arrive surprisingly early — the first time an overnight summarization job makes the morning chat unusable, you are writing scheduling policy whether you meant to or not.

Request scheduling is the logic inside a large language model inference server that decides, at each step, which queued requests enter the running batch, which…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners