Evaluation

The Problem

Without a shared benchmark, every paper claims SOTA on whatever subset of tasks makes its model look best. The field has no shared baseline; practitioners can’t tell what works for their specific use case. Models may overfit to a single benchmark and fail to transfer. The general problem: how to design evaluation methodology that’s comprehensive, comparable across systems, and resistant to overfitting.

The Key Insight

A well-designed benchmark is a coordination mechanism for the field, not just a measurement tool. By picking a fixed set of tasks, datasets, and metrics, and committing to evaluating all models on the complete set under a fixed harness, the benchmark forces models to demonstrate generality rather than cherry-picked strengths. The leaderboard is the product — it concentrates research effort and lets practitioners pick models for their actual needs.

Mechanism in Plain English

A good benchmark has:

Coverage: enough tasks to span the practical use cases of the model class. For embedding models: classification, retrieval, clustering, similarity, etc. For LLMs: reasoning, math, code, knowledge, instruction-following.
Public datasets and labels: anyone can re-evaluate any model. No private test set surprises.
Standard harness: a single tool that runs every model identically. Eliminates “we evaluated differently” excuses.
Held-out test data: training contamination is a real risk; the benchmark must keep test data unseen.
Living leaderboard: a public, dated record of all submissions. Reproducibility through transparency.
Periodic refresh: benchmarks saturate. New, harder tasks must be added before the old ones become trivial.

ASCII Diagram

PRE-BENCHMARK ERA:
  Paper A: "We beat SOTA on Task X by 2 points!"
  Paper B: "We beat SOTA on Task Y by 3 points!"
  Practitioner: "...so which model do I use for my task Z?"
  No way to tell. Field fragments.

POST-BENCHMARK ERA:
  Benchmark covers Tasks 1..N.
  All models evaluated on all N tasks.
  Leaderboard sorts by macro-average and per-task.
  Practitioner: "My task is closest to Task 7. Look up the leaderboard's Task 7 ranking."
  Field converges; comparison is meaningful.

Concrete Walkthrough

Examples of high-impact benchmarks:

Benchmark	Domain	Tasks	Impact
ImageNet (2009)	Vision	Classification	Catalyzed the deep learning revolution
GLUE / SuperGLUE (2018-19)	NLU	9-10 tasks	Drove BERT-era progress
MMLU (2020)	Knowledge LLMs	57 subjects	Standard for LLM “general knowledge”
HumanEval (2021)	Code	164 problems	Standard for code-LLM evaluation
MTEB (2022)	Embeddings	58 datasets, 8 categories	Standard for embedding models
HELM (2022)	LLM holistic	Many	Multi-axis evaluation framework
SWE-bench (2024)	Code agents	Real GitHub issues	Standard for SWE agent eval

The pattern: a benchmark that covers an important model class with a sufficiently broad task set, run on a public leaderboard, becomes the de facto standard. New papers must report on it. Practitioners use it for model selection. The field’s research effort concentrates around making leaderboard improvements.

What’s Clever

The first clever recognition: fragmented evaluation is a coordination failure. When papers can pick their own tasks, the field’s progress measurements are noisy. A standard benchmark forces apples-to-apples comparison and lets the community see real progress vs cherry-picking.

The second clever recognition: the leaderboard is the artifact, not the paper. The MTEB paper itself is mostly methodology; its impact is the public Hugging Face leaderboard that hundreds of subsequent embedding models are evaluated on. The same is true of GLUE, MMLU, SWE-bench. The benchmark’s value compounds with submissions.

The third recognition: benchmarks must be retired. Once a benchmark is saturated (every top model is within noise of human performance), further “improvements” measure overfitting, not progress. ImageNet is saturated; CIFAR-10 is saturated; SuperGLUE is saturated. Each was retired and replaced by a harder benchmark. Field health depends on this churn.

The fourth recognition: benchmarks shape research direction. Whatever the benchmark measures gets optimized. If your benchmark only measures retrieval, you’ll get retrieval-specialized models. If it measures both retrieval and clustering, you’ll get more generalist models. This is sometimes called the “Goodhart pressure” — the benchmark becomes a target, then ceases to be a good measure. Mitigation: cover broad task categories, refresh periodically.

Key Sources

mteb-massive-text-embedding-benchmark — foundational paper for embedding evaluation; canonical example of a successful benchmark
emergent-abilities-of-large-language-models — relies on benchmark suites to detect phase transitions; demonstrates how evaluation methodology shapes findings

benchmark — the artifact form of evaluation
sentence-embeddings — the model class MTEB evaluates
evaluation — this page; foundational

Open Questions

How to detect benchmark contamination? If the test set leaks into pretraining data, scores become uninformative. Increasingly important problem with web-scale pretraining.
Holistic vs task-specific evaluation: HELM argues for many axes (accuracy, robustness, fairness, efficiency); narrow benchmarks (MMLU) measure one. Trade-off?
Evaluating generative outputs: when the output is open-ended (chat, summarization), automatic metrics (BLEU, ROUGE) correlate poorly with quality. LLM-as-judge is the modern compromise but introduces its own biases.
Cost of evaluation: running a model on 58 MTEB datasets takes hours. Comprehensive evaluation is expensive. Subsampling strategies are needed.
How to evaluate agentic systems? SWE-bench is one direction (real-world tasks). Game-like benchmarks (WebArena, AgentBench) are another. Still early days.

ML Wiki

Explorer

Evaluation

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Concrete Walkthrough

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Evaluation

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Concrete Walkthrough

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks