Definition
MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Dan Hendrycks and colleagues in 2020 to measure how much world knowledge a large language model has absorbed and how well it can reason across many domains. It poses 15,908 four-option multiple-choice questions spanning 57 subjects, ranging from elementary mathematics, US history, and computer science to professional-level law, medicine, and moral philosophy. Questions are drawn at high-school to graduate and professional-exam difficulty, so a high score requires both broad recall and the ability to apply that knowledge.
How scoring works
A model is shown a question and four answer choices and must select the correct option. The headline number is simple accuracy, usually reported as a single percentage averaged across all subjects. Because random guessing yields 25%, scores near that floor indicate little real understanding, while top contemporary models exceed 85%. Results are commonly reported in a few-shot setting, where the model first sees a handful of worked examples before answering.
Why it matters for evaluation
MMLU became a default yardstick for general-purpose model capability because its breadth resists the narrow over-fitting that single-topic tests invite. Its weaknesses are well documented: some original questions contain errors or ambiguous answers, and as the benchmark aged, training-data contamination became a concern, prompting harder successors such as MMLU-Pro. For anyone reading a model card, an MMLU figure is a coarse but useful signal of broad knowledge, best interpreted alongside reasoning- and code-specific tests.
MMLU is one entry in a wider family of evaluations covered in our glossary, including the reasoning-focused GPQA benchmark and the math-focused GSM8K benchmark. Understanding these helps a sovereign operator judge which open model is fit for a self-hosted stack.
In Simple Terms
MMLU (Massive Multitask Language Understanding) is a benchmark introduced by Dan Hendrycks and colleagues in 2020 to measure how much world knowledge a large language…
