Benchmarks

Benchmarks are standardized evaluation suites (datasets, tasks, and scoring protocols) used to compare AI models or systems in a consistent way, often across teams or over time. They are a subset of evals where inputs and metrics are intentionally fixed to enable repeatable, apples-to-apples comparisons.

Details

Open-ended benchmarks (e.g., MT-Bench, AlpacaEval) increasingly use model-as-a-judge scoring where fixed-answer evaluation is impractical.

Benchmarks are often published with leaderboards, which can create incentives to optimize for benchmark-specific distributions rather than real-world performance (see eval-reality gap). They can also be compromised by data contamination (benchmark examples leaking into training data) or metric gaming - a form of reward hacking where LLM developers or reinforcement learning pipelines optimize for benchmark scores rather than genuine capability.

Examples

  • Common standardized benchmarks named in model cards:
    • MMLU: multiple-choice questions across many academic/professional subjects.
    • GSM8K: grade-school math word problems.
    • HumanEval: Python coding tasks scored by passing unit tests.
    • HellaSwag: commonsense multiple-choice completion/inference.
    • ARC-Challenge (ARC): grade-school science multiple-choice questions (challenge set).
  • A public leaderboard benchmark for coding or tool use, used to compare model versions.
  • An internal standardized suite run weekly to track performance across product surfaces.

Synonyms

standardized eval suite, leaderboard benchmark

External references