Benchmarks
Benchmarks are standardized evaluation suites (datasets, tasks, and scoring protocols) used to compare AI models or systems in a consistent way, often across teams or over time. They are a subset of evals where inputs and metrics are intentionally fixed to enable repeatable, apples-to-apples comparisons.
Details
Open-ended benchmarks (e.g., MT-Bench, AlpacaEval) increasingly use model-as-a-judge scoring where fixed-answer evaluation is impractical.
Benchmarks are often published with leaderboards, which can create incentives to optimize for benchmark-specific distributions rather than real-world performance (see eval-reality gap). They can also be compromised by data contamination (benchmark examples leaking into training data) or metric gaming - a form of reward hacking where LLM developers or reinforcement learning pipelines optimize for benchmark scores rather than genuine capability.
Examples
- Common standardized benchmarks named in model cards:
- MMLU: multiple-choice questions across many academic/professional subjects.
- GSM8K: grade-school math word problems.
- HumanEval: Python coding tasks scored by passing unit tests.
- HellaSwag: commonsense multiple-choice completion/inference.
- ARC-Challenge (ARC): grade-school science multiple-choice questions (challenge set).
- A public leaderboard benchmark for coding or tool use, used to compare model versions.
- An internal standardized suite run weekly to track performance across product surfaces.
Synonyms
standardized eval suite, leaderboard benchmark