Definition
Temperature is the sampling parameter that controls how random a language model's output is at inference time. After the model computes raw scores (logits) for every possible next token, it divides those logits by the temperature value before applying the softmax function that turns them into probabilities. The result reshapes the probability distribution the model samples from, trading determinism against variety with a single number.
How the math behaves
With temperature below 1, the distribution sharpens: high-probability tokens dominate even more, so output is focused and repeatable. With temperature above 1, the distribution flattens: unlikely tokens gain probability mass, so output is more diverse and surprising. A temperature of 0 is effectively greedy decoding, always taking the single most probable token, which makes responses deterministic. Typical values run from 0 to about 2.
Choosing a value
For factual, structured, or code-generation tasks, low temperature (0 to 0.3) reduces the chance of drift and keeps answers consistent. For brainstorming or creative writing, higher values (0.7 to 1.2) produce more varied output. Temperature interacts with other samplers like top-p (nucleus sampling) and top-k, which prune the candidate set before temperature is applied.
It is worth noting that low temperature reduces randomness but does not guarantee truth; a confident wrong answer can still appear, which is the realm of hallucination. When you self-host an open-weight model, you control temperature directly at the sampler, giving you reproducible output that a locked-down vendor API may not expose.
In Simple Terms
Temperature is the sampling parameter that controls how random a language model’s output is at inference time. After the model computes raw scores (logits) for…
