Concepts: sentence-embeddings | siamese-networks | contrastive-learning | bi-encoder | semantic-similarity Builds on: bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — uses pretrained BERT as the encoder Leads to: rag-retrieval-augmented-generation — SBERT-style bi-encoders are the default retrieval backbone in RAG systems
BERT is a phenomenal sentence encoder for tasks where you process two sentences together — give it both as input with a [SEP] token, run a single forward pass, and read out the relevance score. But this is fatal for retrieval. To find the most similar sentence in a 10,000-sentence corpus, you’d need to run BERT 10,000 times per query, each time feeding a (query, candidate) pair — about 65 hours per query on a V100. Sentence-BERT (Reimers & Gurevych, EMNLP 2019) reorganizes the architecture: encode each sentence independently into a fixed-length vector, then compare with cosine similarity. The same 10K-sentence search drops to ~5 seconds, with negligible accuracy loss.
The core idea
The analogy: A cross-encoder (BERT-style) is like reading two essays side by side to compare them — slow but careful, you notice every interaction. A bi-encoder (Sentence-BERT-style) is like writing a one-paragraph summary of each essay separately, then comparing summaries — much faster, slightly lossy, but the summaries are reusable.
The architecture:
SENTENCE A ----[BERT]----> pool ----> u (768-dim vector)
\
cosine
/
SENTENCE B ----[BERT]----> pool ----> v (768-dim vector)
The two BERT towers share weights (siamese). The pooling step takes the per-token output and reduces it to a fixed vector — the paper finds mean pooling over all token outputs works best, beating the [CLS] token and max pooling. Then cosine similarity (or any vector distance) on (u, v) gives the relevance score.
Training: fine-tune the entire siamese stack on Natural Language Inference (NLI) pairs. Three labels: entailment (the sentences mean similar things), neutral, contradiction. The loss is a softmax classifier over the concatenation [u, v, |u-v|]. After training, you discard the classifier; the encoder produces semantically meaningful sentence vectors.
Walkthrough
The 10K-sentence retrieval problem:
SETUP: query = "How tall is the Eiffel Tower?"
corpus = 10,000 paragraphs.
CROSS-ENCODER (vanilla BERT):
for each candidate c in corpus:
score = BERT(query, c) # one full forward pass per pair
return top_k(scores)
Cost: 10,000 BERT forward passes * ~25ms = ~250 seconds per query.
For 1M sentences: ~7 hours per query. Untenable.
BI-ENCODER (Sentence-BERT):
q_vec = SBERT(query) # ONE forward pass total
for each candidate c in corpus:
score = cosine(q_vec, c_vec) # c_vec was precomputed offline
return top_k(scores)
Cost (online): one forward pass + 10K cosines = ~50ms per query.
For 1M sentences with FAISS: ~10ms per query.
5000x faster than cross-encoder.
Performance on STS-Benchmark (Spearman correlation, higher is better):
| Method | Spearman |
|---|---|
| Avg GloVe vectors | 58.0 |
| BERT [CLS] embedding (no fine-tune) | 38.7 (worse than GloVe!) |
| BERT mean pool (no fine-tune) | 47.3 |
| SBERT (NLI fine-tuned) | 79.2 |
| Cross-encoder BERT (paired input) | 86.5 |
SBERT closes most of the gap to the cross-encoder, while running ~5000x faster. The remaining 7-point gap is what motivates the modern hybrid pattern: use a bi-encoder to fetch the top-100, then a cross-encoder to re-rank.
What’s clever — find the instinct
The first clever recognition: pretrained BERT embeddings without fine-tuning are terrible for similarity. The [CLS] token vector is supposed to be a sentence summary, but in practice it scores worse than averaging GloVe word embeddings — it has been trained for masked-language-model and next-sentence-prediction objectives, not for similarity. You can’t just plug pretrained BERT into a vector index and get good results.
“We found that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity.”
The second clever move: use NLI as the pretraining signal for similarity. NLI has 1M+ labeled sentence pairs, and the entailment/contradiction labels are precisely what you want to encode in vector space — semantically equivalent sentences should be near each other; contradictory ones should be far. Fine-tuning a siamese network on NLI gives the embedding space the structure that cosine similarity needs.
“We use NLI data to fine-tune SBERT to produce sentence embeddings.”
The third clever move: mean pooling over all tokens, not just [CLS]. The [CLS] token was trained for next-sentence-prediction, not for capturing semantic meaning of the sentence. Mean pooling forces every token to contribute, producing a more uniformly informative vector.
“Pooling-strategy […] MEAN-strategy works the best.”
Does it work? What breaks?
Speed comparison (10K sentences, finding the most similar pair):
| Method | Time |
|---|---|
| BERT cross-encoder | ~65 hours |
| InferSent | ~6 seconds |
| Universal Sentence Encoder | ~5 seconds |
| Sentence-BERT | ~5 seconds |
Accuracy on transfer tasks (SentEval benchmark, average score):
| Method | Avg |
|---|---|
| GloVe | 60.6 |
| InferSent | 76.0 |
| SBERT | 80.1 |
SBERT beats prior fast-retrieval methods (InferSent, USE) on accuracy and matches them on speed.
What breaks:
- Cross-encoder still wins on accuracy when speed isn’t a constraint. For re-ranking the top-K from a bi-encoder, a cross-encoder gives 5-10 points more on hard tasks.
- Out-of-domain transfer is uneven. SBERT trained on NLI works well for general semantic similarity but may need domain-specific fine-tuning for legal, medical, or code retrieval.
- Sentence length sensitivity. Mean pooling can be biased by sentence length; very short sentences sometimes embed strangely. Modern variants use attention pooling.
- Bias in NLI data. SBERT inherits whatever biases are in SNLI and MultiNLI — anglocentric, mostly English, particular topic distribution.
So what?
Sentence-BERT is the foundational architecture for every modern dense-retrieval system. The bi-encoder pattern — encode independently, compare via cosine, index with FAISS — is the backbone of:
- RAG systems: retrieve relevant chunks, then condition the LLM on them.
- Semantic search: replacing BM25 in user-facing search.
- Deduplication: SBERT vector similarity is a fast clusterer.
- Recommendation: text-based content similarity.
- Multilingual cross-lingual retrieval: train SBERT on aligned multilingual pairs.
For Saikat’s work on POI dedup and Indonesian address normalization, SBERT is the canonical retrieval backbone. The pattern:
- Address normalization SaaS: encode every input address with a domain-fine-tuned SBERT; index canonical addresses; lookup via cosine. “Jl Sudirman No 23” and “Jalan Sudirman 23” land near each other; “Sudirman 23” maps to the same canonical record.
- POI dedup: encode POI name + address text; cluster by cosine. Combine with the CatBoost geometric features for the final dedup decision.
- Trajectory-with-text: encode the route’s stop names; combine with t2vec trajectory embedding.
The deeper principle: the bi-encoder/cross-encoder split is a fundamental design choice in retrieval. Bi-encoder is fast and offline-indexable but loses fine-grained interaction. Cross-encoder is precise but linear in candidate count. The modern compromise — bi-encoder for the first stage, cross-encoder for re-ranking — was made viable by SBERT making the bi-encoder accuracy-competitive in the first place. ColBERT later finds a middle ground (late interaction).
“We present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.”
For the practitioner: when in doubt, start with a SBERT-style bi-encoder. The Hugging Face sentence-transformers library is the canonical implementation. Modern descendants (BGE, GTE, E5, Mistral-Embed) are drop-in replacements with better accuracy.
Connections
- bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — SBERT uses pretrained BERT as its encoder
- rag-retrieval-augmented-generation — SBERT-style retrievers are the default RAG backbone
- sentence-embeddings — SBERT is the foundational paper for the concept
- siamese-networks — the architecture pattern
- contrastive-learning — NLI-based fine-tuning is a contrastive-learning instance
- bi-encoder — SBERT introduces the bi-encoder vs cross-encoder distinction in modern NLP
- semantic-similarity — the canonical evaluation domain
Citation
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084