Benchmarks

Benchmarks are standardized evaluation suites (datasets, tasks, and scoring protocols) used to compare AI models or systems in a consistent way, often across teams or over time. They are a subset of evals where inputs and metrics are intentionally fixed to enable repeatable, apples-to-apples comparisons.

Details

Open-ended benchmarks (e.g., MT-Bench, AlpacaEval) increasingly use model-as-a-judge scoring where fixed-answer evaluation is impractical.

Benchmarks are often published with leaderboards, which can create incentives to optimize for benchmark-specific distributions rather than real-world performance (see eval-reality gap). They can also be compromised by data contamination (benchmark examples leaking into training data) or metric gaming - a form of reward hacking where LLM developers or reinforcement learning pipelines optimize for benchmark scores rather than genuine capability.

Examples

Common standardized benchmarks named in model cards:
- MMLU: multiple-choice questions across many academic/professional subjects.
- GSM8K: grade-school math word problems.
- HumanEval: Python coding tasks scored by passing unit tests.
- HellaSwag: commonsense multiple-choice completion/inference.
- ARC-Challenge (ARC): grade-school science multiple-choice questions (challenge set).
A public leaderboard benchmark for coding or tool use, used to compare model versions.
An internal standardized suite run weekly to track performance across product surfaces.

Synonyms

standardized eval suite, leaderboard benchmark

Benchmarks

Details

Examples

Synonyms

External references