C-Pack / BGE: Packed Resources for General Chinese Embeddings

Concepts: sentence-embeddings | contrastive-learning | bi-encoder | self-supervised-learning Builds on: sentence-bert-siamese-bert-networks — BGE is the modern descendant of SBERT, with the same bi-encoder architecture but vastly more training data and more sophisticated training stages Builds on: mteb-massive-text-embedding-benchmark — MTEB is the benchmark that made BGE’s claims comparable; BGE’s reign was specifically on MTEB Leads to: the modern open-source embedding stack (BGE-m3 for multilingual, BGE-reranker for cross-encoder reranking) and competitors like GTE, E5, and NV-Embed

Sentence-BERT (2019) showed dense embeddings could replace cross-encoders for retrieval. By 2023 the recipe had matured but most strong models were closed source — OpenAI’s text-embedding-ada-002, Cohere’s embed-multilingual. C-Pack (Xiao et al., SIGIR 2024), the paper that introduced the BAAI BGE model family, broke the closed-source dominance. The contribution is mostly engineering: at the time, no one had assembled the full open-source pipeline at scale (a hundred million pairs of contrastive training data, multi-stage curriculum, large enough base model). BGE did, and the result was the new SOTA on MTEB across both English and Chinese embedding tasks.

The core idea

Three training stages, in order:

General pretraining (RetroMAE). A masked-autoencoder objective: corrupt a passage by aggressively masking tokens, ask the model to reconstruct them. This teaches the encoder to compress meaning into the hidden state. Done on raw web text — no labeled pairs.
General-purpose contrastive pretraining. Now train the encoder for similarity, using a massive corpus of weakly-paired data: web titles vs body text, question vs answer pairs from Quora and StackExchange, paraphrases from translation backtranslation, retrieval-style queries from CCMatrix. Roughly 100M+ pairs. The model learns to put related text near each other in vector space.
Task-specific fine-tuning. Fine-tune on labeled retrieval datasets (MS MARCO, NQ, T2Ranking) and on the embeddings target tasks. This is where the model picks up the conventions of “what counts as similar for this task.”

At inference: the same as SBERT — encode each sentence to a single 768-dim vector, compare with cosine similarity, optionally normalize.

Walkthrough

The contrastive loss (Stage 2 and 3 use the same form):

For each batch of $N$ query-positive pairs $(q_{i}, p_{i}^{+})$ , also collect “in-batch negatives” — every other passage in the batch is a negative for query $q_{i}$ . Compute:

$L = - \sum_{i = 1}^{N} lo g \frac{e x p ( sim ( q _{i} , p _{i}^{+} ) / τ )}{\sum _{j = 1}^{N} e x p ( sim ( q _{i} , p _{j} ) / τ )}$

Where:

$sim$ is cosine similarity.
$τ$ is a learned or fixed temperature (BGE uses $τ = 0.02$ ).
The numerator scores the true positive; the denominator includes all in-batch negatives.

This is the standard InfoNCE loss. The key practical detail: BGE uses very large batch sizes (32K+) to get many negatives per query, plus “hard negatives” mined from a previous-iteration retriever. Hard negatives are passages that an early version of the model thought were relevant but actually aren’t — they are the most informative negatives because they teach the model the boundary cases.

Stage 2 data scale:

General contrastive pairs:
  - Web title-body: ~50M pairs
  - Question-answer (Quora, StackExchange): ~10M pairs
  - Translation-paraphrase backtranslation: ~20M pairs
  - Retrieval queries from CCMatrix: ~30M pairs
  Total: ~100M weakly-supervised pairs.

Stage 3 fine-tuning data:
  - MS MARCO: 500K pairs
  - NQ: 100K pairs
  - T2Ranking (Chinese): 200K pairs
  - Custom curated: ~1M pairs
  Total: ~2M high-quality labeled pairs.

Model sizes released:

Model	Params	Hidden	Languages	Best at
BGE-small-en	33M	384	EN	Speed
BGE-base-en	109M	768	EN	Default
BGE-large-en	335M	1024	EN	Accuracy
BGE-m3	568M	1024	100+	Multilingual
BGE-reranker	278M	768	EN/ZH	Cross-encoder reranker

What’s clever — find the instinct

The first clever recognition: the bottleneck for open-source embeddings was data, not model architecture. SBERT’s architecture from 2019 was already enough — what was missing was a 100M-pair training corpus to fine-tune on. Closed-source models were winning by hoarding labeled pairs. BGE’s contribution was assembling the open-source equivalent.

“C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models.”

The second clever move: multi-stage curriculum. RetroMAE pretraining first (no contrastive), then weak-supervised contrastive (massive but noisy), then strong-supervised fine-tuning (small but clean). Each stage prepares the model for the next. Skipping stages — or doing them in the wrong order — produces noticeably worse models.

“The training procedure of C-TEM consists of three stages: general pre-training, general purpose fine-tuning, and task-specific fine-tuning.”

The third clever move: large batch sizes plus hard negative mining. The InfoNCE loss benefits from more negatives per query — in the limit, it’s better to compare a query to 32K passages than to 256. BGE pushes batch sizes up to 32K-64K using gradient accumulation and mixed precision. They then add hard negatives mined from a previous checkpoint, which doubles or triples the model’s discrimination on edge cases.

The fourth recognition: multilinguality is a separate axis. BGE-m3 isn’t just BGE trained on multilingual data — it’s a deliberately multilingual model with a special embedding mode that produces dense, sparse (BM25-like), and multi-vector (ColBERT-like) representations from a single pass. This is the modern hybrid retrieval pattern.

Does it work? What breaks?

MTEB English leaderboard at BGE release (Sept 2023):

Rank	Model	Avg MTEB
1	BGE-large-en	64.2
2	E5-large-v2	62.3
3	OpenAI text-embedding-ada-002	60.9
4	sentence-t5-xxl	59.5
5	gtr-t5-xxl	58.0

MTEB Chinese (C-MTEB):

Rank	Model	Avg C-MTEB
1	BGE-large-zh	64.5
2	OpenAI ada-002 (multilingual)	53.0
3	text2vec-base-chinese	47.4

A 10+ point gap on Chinese embeddings — driven by the Chinese-specific contrastive corpus.

Retrieval quality on BEIR (zero-shot domain transfer):

Model	Avg nDCG@10
BM25	41.7
OpenAI ada-002	49.2
BGE-large-en	52.6

BGE generalizes to out-of-distribution domains (legal, medical, financial) better than ada-002, despite being trained on similar mixes — likely because the contrastive corpus is more diverse.

What breaks:

Long-context. BGE encodes up to 512 tokens. Long documents need chunking. (BGE-m3 extends to 8192 tokens.)
Domain-specific. On highly specialized domains (legal Korean, medical German), out-of-the-box BGE underperforms domain-fine-tuned alternatives.
Saturation on MTEB. Top-of-leaderboard models (NV-Embed, E5-Mistral, GTE-large-en-v2) are within 1-2 points of each other in 2024. Further improvements may be benchmark overfitting.
Query/passage symmetry. BGE encodes both queries and passages with the same model; some tasks benefit from asymmetric encoders (different prompts for queries vs passages — used by E5 and modern BGE prompts).
Inference cost. BGE-large is 335M params; production at scale needs distillation to a smaller model or specialized hardware.

So what?

BGE is the open-source default for dense embeddings as of 2024-2025. The Hugging Face Inference Endpoints, the Vespa engine, and most RAG implementations ship BGE as the default model. BGE-m3 dominates multilingual retrieval. The BGE-reranker is the default re-ranking stage.

For Saikat’s work:

Address normalization SaaS: BGE-m3 is the right starting point for an Indonesian-focused embedding service. Out-of-the-box it handles Bahasa Indonesia and English; fine-tune on a small set of Indonesian address pairs to lock in domain conventions.
POI dedup: BGE-base-en for English POI names; for Indonesian POIs, fine-tune BGE-m3 on a curated dataset of “different surface forms of the same POI.” Combine the BGE similarity feature with the existing CatBoost geometric features.
RAG over internal docs: BGE-large-en for the index; BGE-reranker for re-ranking the top-100. The CodeAct-style agent then conditions on the reranked top-5.

The deeper principle BGE establishes: scale + curated data + multi-stage curriculum is the modern open-source recipe. The architecture (bi-encoder, contrastive InfoNCE) hasn’t changed since SBERT. What changed is the data pipeline. Anyone can replicate this — but it costs ~$50-100K in compute and a year of curation work. BGE’s value proposition is that BAAI did the work and released the weights.

“Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release.”

For the practitioner: start with BGE-base-en for English and BGE-m3 for multilingual. Only train your own embedding model if you have a strong domain-specific signal that BGE doesn’t capture. Even then, fine-tune from BGE rather than training from scratch.

Connections

sentence-bert-siamese-bert-networks — BGE inherits the bi-encoder architecture from SBERT
mteb-massive-text-embedding-benchmark — the benchmark BGE was specifically designed to win
colbert-late-interaction-retrieval — BGE-m3’s multi-vector mode produces ColBERT-style late-interaction embeddings
rag-retrieval-augmented-generation — BGE is the default RAG retrieval backbone
sentence-embeddings — BGE is the modern default sentence embedder
contrastive-learning — the loss is InfoNCE; the mining strategy is hard-negative
bi-encoder — BGE is a bi-encoder; pair with BGE-reranker (cross-encoder) for hybrid retrieval
self-supervised-learning — RetroMAE pretraining and most contrastive data are self-supervised

Citation

arXiv:2309.07597

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., & Nie, J.-Y. (2023). C-Pack: Packed Resources For General Chinese Embeddings. SIGIR 2024. https://arxiv.org/abs/2309.07597

ML Wiki

Explorer