Concepts: sentence-embeddings | evaluation | benchmark | bi-encoder Builds on: sentence-bert-siamese-bert-networks — MTEB became the canonical eval target for SBERT-style and successor models Leads to: the modern embedding-model leaderboard culture (Hugging Face MTEB leaderboard); BGE, GTE, E5, NV-Embed all evaluate first on MTEB
Before MTEB, evaluating sentence embeddings was a mess. Each paper picked a small subset of tasks — usually STS-Benchmark and a couple of GLUE classification tasks — and claimed state of the art. But “good for STS” doesn’t mean “good for retrieval,” and “good for English clustering” doesn’t mean “good for cross-lingual mining.” Practitioners couldn’t tell which embedding model would actually work for their task. MTEB (Muennighoff et al., 2022) defined the field’s first comprehensive benchmark: 8 task categories, 58 datasets, 112 languages, all under one evaluation harness. The headline finding was deflationary but important: no embedding model is best at everything.
The core idea
The 8 task categories MTEB covers:
- Classification — linear probe on top of frozen embeddings (binary/multi-class).
- Clustering — k-means on embeddings, evaluated by V-measure against gold labels.
- Pair classification — given two texts, predict if they’re related (paraphrase, entailment, duplicate). Evaluated by AP.
- Reranking — given a query and candidate list, rerank by similarity. Evaluated by MAP and MRR.
- Retrieval — given a query, retrieve the relevant document from a large corpus. Evaluated by nDCG@10.
- STS (Semantic Textual Similarity) — predict similarity score for pairs. Evaluated by Spearman correlation.
- Summarization — embedding-based scoring of summary quality.
- Bitext mining — given parallel-corpus mining, find translation pairs. Evaluated by F1.
Coverage: 58 datasets across all 8 tasks, 112 languages (heavy on English, also Chinese, Russian, Korean, German, Spanish, French).
Methodology: standardize the harness so a model is evaluated identically across all 58 datasets. Default: encode each text once, use cosine similarity (for retrieval/reranking/STS) or Euclidean distance with k-means (for clustering) or a linear classifier on top (for classification).
Walkthrough
The headline finding (Table 4 in the paper, average MTEB score across all tasks, top-10 of 33 models):
| Model | Avg MTEB | Best at… |
|---|---|---|
| ST5-11B (instruction-tuned T5) | 64.0 | STS |
| GTR-XXL | 60.6 | Retrieval |
| Sentence-T5-base | 59.5 | Classification |
| MPNet (all-mpnet-base) | 57.8 | Generalist |
| GTR-base | 56.6 | Retrieval |
| MiniLM-L12 | 56.5 | Speed-vs-accuracy |
| OpenAI text-embedding-ada-002 | 60.5 | Generalist (closed-source) |
| Cohere multilingual-22-12 | 64.5 (English) | Multilingual |
The deflationary finding: the highest-scoring model on STS (ST5-11B at 80.4 Spearman) is not the highest on retrieval (GTR-XXL at 49.3 nDCG@10). The highest on classification is yet another model. No single embedding wins all 8 task categories.
This invalidates a year of “we beat ada-002” claims that compared on a single task category. It also explains why production teams report different “best embedding” choices — they were optimizing different tasks all along.
A worked example of how MTEB scoring works for retrieval (e.g., MS MARCO):
SETUP:
- Corpus: 8.8M passages.
- Queries: 7K test queries with relevance labels.
PER MODEL:
1. Encode every passage once (offline, ~1 hour for 8.8M).
2. Encode every query (~1 second per query).
3. For each query: compute cosine vs every passage; rank.
4. Score: nDCG@10 (how good are the top 10 results?).
REPORTED METRIC: average nDCG@10 across all 7K queries.
The same encoder is then run on the other 57 datasets with task-specific evaluation. The model’s MTEB score is the macro-average across the 8 task category averages.
What’s clever — find the instinct
The clever move is recognizing that fragmented evaluation was holding the embedding field back. Every paper claimed SOTA on a chosen-favorable subset, so the field had no shared baseline. By picking 58 datasets and committing to evaluating all of them under a fixed harness, MTEB forces models to demonstrate generality, not cherry-picked strengths.
“We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.”
The second clever recognition: the leaderboard is the product. The paper’s biggest impact wasn’t the specific model rankings; it was establishing a public Hugging Face leaderboard where any new model could be added. By 2024 the MTEB leaderboard had become the gravitational center of embedding research — the BGE family, GTE, E5, NV-Embed, Mistral-embed, OpenAI’s text-embedding-3 series, all benchmarked first on MTEB.
“MTEB comes with open-source code and a public leaderboard.”
The third clever recognition: including bitext mining and multilingual tasks forced models to be honest about cross-lingual performance. Pre-MTEB, multilingual models claimed strong performance based on a few high-resource languages. MTEB’s bitext mining task tests on 112 languages, including low-resource ones, exposing where models really fail.
Does it work? What breaks?
Practical insights from the benchmark (Table 5, paraphrased):
| If your task is… | Best general advice |
|---|---|
| STS / paraphrase | Sentence-T5 or Cohere multilingual |
| Retrieval (English) | GTR or BGE |
| Classification | MPNet or generalist Sentence-T5 |
| Clustering | MPNet (good across all clustering tasks) |
| Multilingual retrieval | Cohere multilingual or BGE-m3 |
| Speed-constrained | MiniLM-L6 or L12 |
The general lessons MTEB surfaced:
- Bigger is not always better. ST5-11B (11B params) beats much smaller models on STS but loses to GTR on retrieval. Scale within a task family helps; cross-family transfer is uneven.
- Task-specific fine-tuning helps a lot. GTR is fine-tuned for retrieval and dominates retrieval; it doesn’t dominate STS. ST5 is the opposite.
- Closed source isn’t dominant. OpenAI ada-002 ranks well but doesn’t dominate. Open models like BGE eventually beat it on most categories.
- Multilingual isn’t free. Cross-lingual transfer requires explicitly multilingual pretraining; English-only models score poorly on bitext mining.
What breaks (limitations of MTEB itself):
- Domain coverage. MTEB’s tasks are drawn from public datasets, mostly Wikipedia / news / general web. Domain-specific embeddings (legal, medical, code) need separate evaluation.
- Long-document tasks. Most MTEB datasets have short text (sentences, tweets, abstracts). Long-document retrieval (e.g., legal contracts) is underrepresented.
- Recency. MTEB v1 (the paper’s version) was English-heavy. MTEB-MMTEB and CMTEB later expanded to more languages.
- Reranker-style models. MTEB scores bi-encoders. Cross-encoders and ColBERT-style late-interaction models score differently and are evaluated separately.
- Saturation on STS. Several tasks (especially STS subsets) have ceiling around 85-90; further improvements are marginal and may overfit to the benchmark.
So what?
MTEB is the standard. Every embedding model paper now reports MTEB scores in Table 1. The Hugging Face MTEB leaderboard is checked daily by retrieval engineers. The benchmark’s release coincided with — and arguably caused — the explosion of open-source embedding models in 2023-2024 (BGE, GTE, E5).
For Saikat’s work:
- POI dedup with text features: choose embeddings based on MTEB clustering and pair-classification scores, not retrieval. Clustering scores tell you how well the embedding spaces separate semantically distinct entities.
- Indonesian address normalization: this is bitext-mining-shaped (parallel mining across address surface forms). Filter MTEB models by their bitext mining and multilingual classification scores. BGE-m3 is the current default for multilingual settings.
- Career gap (large-scale model serving): embedding models are simpler to serve than LLMs but have their own production challenges — cold-start indexing, vector store sharding, periodic re-indexing on retraining. MTEB doesn’t measure those, but it’s the input to the model-selection step.
The deeper principle MTEB establishes: a well-designed benchmark is a coordination mechanism. The field had the components for embedding evaluation before MTEB — STS-Benchmark, BEIR, BUCC bitext mining — but they were scattered. MTEB’s contribution was aggregation. Once a single leaderboard existed, all serious models converged to evaluating on it. This is the same pattern as ImageNet for vision, GLUE/SuperGLUE for NLU, MMLU for LLMs, and SWE-bench for code agents.
“Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date.”
For the practitioner: always check MTEB before picking an embedding model. Filter the leaderboard by your task category (clustering vs retrieval vs classification), language, and inference cost. Don’t pick the top of the macro-average leaderboard unless your task is truly multipurpose.
Connections
- sentence-bert-siamese-bert-networks — the foundational embedding model evaluated on MTEB; established the bi-encoder pattern MTEB tests
- colbert-late-interaction-retrieval — late-interaction models are evaluated separately from bi-encoders on retrieval
- sentence-embeddings — MTEB is the canonical evaluation for the entire concept
- evaluation — establishes the methodology principle “evaluate on a wide task suite”
- benchmark — the most-cited modern embedding benchmark
- bi-encoder — MTEB measures bi-encoders’ generalization across tasks
Citation
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive Text Embedding Benchmark. EACL 2023. https://arxiv.org/abs/2210.07316