Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

MT-Bench

Sovereign AI

Definition

MT-Bench is a benchmark introduced by Lianmin Zheng and colleagues in 2023 to evaluate a language model's conversational ability over multiple turns. It consists of 80 carefully written questions spread across eight categories: writing, roleplay, information extraction, reasoning, mathematics, coding, and the STEM and humanities knowledge areas. Each item has at least two conversational turns, so the test measures not just a single good answer but whether the model maintains coherence and follows up correctly across a dialogue.

LLM-as-judge scoring

Because open-ended chat answers cannot be checked against a single reference string, MT-Bench uses the "LLM-as-judge" approach: a strong model such as GPT-4 reads the responses and either assigns each a numerical quality score or picks the preferred answer in a pairwise comparison. The original authors reported that a capable LLM judge agreed with human preferences over 80% of the time, comparable to the agreement between two humans. This made automated, scalable grading of free-form responses practical.

Why it is used and its limits

MT-Bench fills a gap between rigid multiple-choice exams and slow, expensive human evaluation, offering a repeatable score for conversational quality. Its caveats are real: an LLM judge can inherit biases, favor longer or more confident answers, and prefer outputs that resemble its own style, so results should be cross-checked against human rankings. The small 80-question set also limits statistical resolution among closely matched models.

MT-Bench is often read alongside the human-vote-driven Chatbot Arena Elo ranking, which shares an origin and a focus on real conversational preference rather than fixed-answer capability tests.

In Simple Terms

MT-Bench is a benchmark introduced by Lianmin Zheng and colleagues in 2023 to evaluate a language model’s conversational ability over multiple turns. It consists of…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners