Definition
Request scheduling is the logic inside a large language model inference server that decides, at each step, which queued requests enter the running batch, which are preempted, and in what order their prompts are processed. Because GPU memory and compute are finite, the scheduler is what turns a pile of incoming requests into a stream of tokens that respects each user's latency expectations while keeping the hardware busy.
The metrics it balances
Two latencies dominate. Time to first token (TTFT) measures how long a user waits after sending a request before any output appears, and is driven mostly by queueing delay and the prefill of the prompt. Time per output token (TPOT), sometimes called time between tokens, measures the pace of streaming once generation starts. A scheduler tries to honour service-level objectives on both while pushing system throughput as high as possible.
Common policies
Naive first-come-first-served scheduling suffers head-of-line blocking, where one long prompt stalls everyone behind it. Practical servers borrow ideas from operating systems: shortest-job-first ordering of prefills to cut average TTFT, least-slack-time-first to protect deadlines, and chunked prefill that breaks a huge prompt into pieces so it interleaves with ongoing decodes rather than monopolizing the GPU. Multi-priority schemes let latency-sensitive requests jump ahead of best-effort background jobs.
For a sovereign operator hosting models on their own hardware, the scheduler is the lever that converts raw GPU capacity into a predictable, fair service. It works hand in hand with in-flight batching and is constrained by the limits set in token budget and rate limiting.
In Simple Terms
Request scheduling is the logic inside a large language model inference server that decides, at each step, which queued requests enter the running batch, which…
