Definition
MT-Bench is a benchmark introduced by Lianmin Zheng and colleagues in 2023 to evaluate a language model's conversational ability over multiple turns. It consists of 80 carefully written questions spread across eight categories: writing, roleplay, information extraction, reasoning, mathematics, coding, and the STEM and humanities knowledge areas. Each item has at least two conversational turns, so the test measures not just a single good answer but whether the model maintains coherence and follows up correctly across a dialogue.
LLM-as-judge scoring
Because open-ended chat answers cannot be checked against a single reference string, MT-Bench uses the "LLM-as-judge" approach: a strong model such as GPT-4 reads the responses and either assigns each a numerical quality score or picks the preferred answer in a pairwise comparison. The original authors reported that a capable LLM judge agreed with human preferences over 80% of the time, comparable to the agreement between two humans. This made automated, scalable grading of free-form responses practical.
Why it is used and its limits
MT-Bench fills a gap between rigid multiple-choice exams and slow, expensive human evaluation, offering a repeatable score for conversational quality. Its caveats are real: an LLM judge can inherit biases, favor longer or more confident answers, and prefer outputs that resemble its own style, so results should be cross-checked against human rankings. The small 80-question set also limits statistical resolution among closely matched models.
MT-Bench is often read alongside the human-vote-driven Chatbot Arena Elo ranking, which shares an origin and a focus on real conversational preference rather than fixed-answer capability tests.
In Simple Terms
MT-Bench is a benchmark introduced by Lianmin Zheng and colleagues in 2023 to evaluate a language model’s conversational ability over multiple turns. It consists of…
