Definition
Token budgeting and rate limiting are the controls that govern how much work a large language model endpoint will accept from a given caller over time. Unlike a traditional API where every request is roughly equal, an LLM call's cost scales with the number of tokens it processes: a 50-token prompt and a 100,000-token prompt both count as one request but consume vastly different compute. So LLM platforms limit usage along a token dimension, most commonly tokens per minute (TPM), alongside the familiar requests per minute (RPM).
How the limits work
Rate limiters typically combine windows. A short per-minute TPM cap smooths bursts and protects the server from being overwhelmed, while a longer per-day or per-month token budget enforces an overall spending ceiling. When a caller exceeds the rate limit, the server returns an HTTP 429 (Too Many Requests); when a hard quota is exhausted, it may return 403 (Forbidden). Well-behaved clients read these responses and back off rather than hammering the endpoint.
Why a self-hoster needs them
Even when you run the model on your own hardware with no per-token bill, token budgets matter. They stop a single runaway agent or misbehaving script from starving every other user of GPU time, they make latency predictable by bounding how much the queue can grow, and they form a first line of defence against abuse. In a multi-user sovereign deployment, per-user quotas are how you keep one tenant from monopolizing a shared GPU.
Token limits shape the work that reaches the scheduler; see request scheduling for how admitted requests are ordered, and throughput-optimized serving for how total capacity is set.
In Simple Terms
Token budgeting and rate limiting are the controls that govern how much work a large language model endpoint will accept from a given caller over…
