Definition
Constrained decoding, also called guided or structured decoding, is a method that guarantees a language model's output conforms to a predefined format such as valid JSON, a regular expression, or a context-free grammar. Instead of generating freely and hoping the result parses, the decoder restricts the model at every step so that only tokens which keep the output legal can be chosen. This turns "please respond in JSON" from a polite request that sometimes fails into a hard guarantee.
How the constraint is enforced
The target format is compiled into a state machine, typically a finite-state machine for regular grammars or a pushdown automaton for nested structures like JSON. At each generation step the automaton's current state defines exactly which next tokens are valid. The decoder applies a mask to the model's output probabilities, setting the probability of every illegal token to zero, then samples only from the survivors. Libraries such as Outlines and XGrammar precompute these state-to-token maps so the masking adds little overhead, and in some cases throughput actually rises because invalid retries are eliminated.
Why sovereign operators care
When a locally hosted model feeds a downstream program, an automation script, a database write, or a tool call, malformed output breaks the pipeline. Constrained decoding makes the model a reliable component: the output is always parseable, so no brittle text-scraping or repair logic is needed. It is the foundation of dependable function calling and agent workflows running entirely on one's own hardware.
Constrained decoding operates purely on the model's output logits at generation time, so it composes cleanly with throughput techniques like continuous batching and with the standard KV cache.
In Simple Terms
Constrained decoding, also called guided or structured decoding, is a method that guarantees a language model’s output conforms to a predefined format such as valid…
