Definition
Perplexity is one of the oldest and most fundamental metrics for evaluating language models. It measures how well a model predicts a sample of text, defined formally as the exponential of the average per-token cross-entropy loss. Intuitively, a perplexity of 10 means that, on average, the model is as uncertain as if it were choosing uniformly among ten equally likely next tokens. Lower perplexity therefore indicates a model that assigns higher probability to the actual text and is more confident and accurate in its predictions.
How it is computed
Given a sequence, the model produces a probability for each token conditioned on the tokens before it. Cross-entropy is the average negative log of those probabilities, and perplexity is the base raised to that cross-entropy. Because the score depends on the test corpus, two models can only be compared on perplexity if they are evaluated on the same data with the same tokenizer; different vocabularies make raw perplexity numbers non-comparable across models.
Uses and limits
Perplexity is an intrinsic metric: it scores the model's raw probability distribution rather than performance on a downstream task. This makes it cheap to compute, useful for tracking pretraining progress, and a sensitive indicator of how well a quantized or fine-tuned model has preserved its base quality. Its weakness is that low perplexity does not guarantee usefulness, correctness, or safety; a model can predict fluent text while being factually wrong, which is why task benchmarks and human evaluation remain necessary.
For sovereign operators running quantized models locally, perplexity is the standard way to check how much quality a compression setting costs. Pair it with task tests such as the MMLU benchmark and code-execution scoring like HumanEval for a complete evaluation.
In Simple Terms
Perplexity is one of the oldest and most fundamental metrics for evaluating language models. It measures how well a model predicts a sample of text,…
