Definition
Repetition penalty is a decoding control that discourages a language model from repeating tokens it has already produced or seen in the prompt. It directly attacks one of the most common failure modes of local inference: the degenerate loop, where a model gets stuck emitting the same phrase or sentence over and over. For anyone self-hosting a model for sovereignty reasons, it is one of the first knobs worth tuning when output quality degrades into echoes.
How it works
Before each token is sampled, the penalty scales down the logits of tokens that have already appeared. Unlike a frequency penalty, which subtracts a linear amount per occurrence, the classic repetition penalty divides the logit by a factor (values above 1.0 suppress repeats, below 1.0 encourage them). Because it operates multiplicatively on the raw logits, it is described as exponential rather than linear, making it a stronger deterrent than frequency or presence penalties.
Tuning in practice
A common range is roughly 1.05 to 1.3. Too low and the model still loops; too high and it starts avoiding legitimately needed tokens, breaking grammar or refusing to reuse necessary technical terms. Code and structured output are especially sensitive, since braces, keywords, and field names must legitimately recur. Repetition penalty is usually combined with, not a replacement for, temperature and nucleus sampling rather than used alone.
It belongs to the same family of sampling controls as Temperature Sampling and Top-p (Nucleus) Sampling, and is frequently paired with a Logit Bias for finer per-token control.
In Simple Terms
Repetition penalty is a decoding control that discourages a language model from repeating tokens it has already produced or seen in the prompt. It directly…
