Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Concepts: sentence-embeddings | siamese-networks | contrastive-learning | bi-encoder | semantic-similarity Builds on: bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — uses pretrained BERT as the encoder Leads to: rag-retrieval-augmented-generation — SBERT-style bi-encoders are the default retrieval backbone in RAG systems

BERT is a phenomenal sentence encoder for tasks where you process two sentences together — give it both as input with a [SEP] token, run a single forward pass, and read out the relevance score. But this is fatal for retrieval. To find the most similar sentence in a 10,000-sentence corpus, you’d need to run BERT 10,000 times per query, each time feeding a (query, candidate) pair — about 65 hours per query on a V100. Sentence-BERT (Reimers & Gurevych, EMNLP 2019) reorganizes the architecture: encode each sentence independently into a fixed-length vector, then compare with cosine similarity. The same 10K-sentence search drops to ~5 seconds, with negligible accuracy loss.

The core idea

The analogy: A cross-encoder (BERT-style) is like reading two essays side by side to compare them — slow but careful, you notice every interaction. A bi-encoder (Sentence-BERT-style) is like writing a one-paragraph summary of each essay separately, then comparing summaries — much faster, slightly lossy, but the summaries are reusable.

The architecture:

SENTENCE A  ----[BERT]---->  pool  ----> u (768-dim vector)
                                                       \
                                                cosine
                                                       /
SENTENCE B  ----[BERT]---->  pool  ----> v (768-dim vector)

The two BERT towers share weights (siamese). The pooling step takes the per-token output and reduces it to a fixed vector — the paper finds mean pooling over all token outputs works best, beating the [CLS] token and max pooling. Then cosine similarity (or any vector distance) on (u, v) gives the relevance score.

Training: fine-tune the entire siamese stack on Natural Language Inference (NLI) pairs. Three labels: entailment (the sentences mean similar things), neutral, contradiction. The loss is a softmax classifier over the concatenation [u, v, |u-v|]. After training, you discard the classifier; the encoder produces semantically meaningful sentence vectors.

Walkthrough

The 10K-sentence retrieval problem:

SETUP: query = "How tall is the Eiffel Tower?"
       corpus = 10,000 paragraphs.

CROSS-ENCODER (vanilla BERT):
  for each candidate c in corpus:
      score = BERT(query, c)        # one full forward pass per pair
  return top_k(scores)

  Cost: 10,000 BERT forward passes * ~25ms = ~250 seconds per query.
        For 1M sentences: ~7 hours per query. Untenable.

BI-ENCODER (Sentence-BERT):
  q_vec = SBERT(query)               # ONE forward pass total
  for each candidate c in corpus:
      score = cosine(q_vec, c_vec)   # c_vec was precomputed offline
  return top_k(scores)

  Cost (online): one forward pass + 10K cosines = ~50ms per query.
                 For 1M sentences with FAISS: ~10ms per query.
                 5000x faster than cross-encoder.

Performance on STS-Benchmark (Spearman correlation, higher is better):

Method	Spearman
Avg GloVe vectors	58.0
BERT [CLS] embedding (no fine-tune)	38.7 (worse than GloVe!)
BERT mean pool (no fine-tune)	47.3
SBERT (NLI fine-tuned)	79.2
Cross-encoder BERT (paired input)	86.5

SBERT closes most of the gap to the cross-encoder, while running ~5000x faster. The remaining 7-point gap is what motivates the modern hybrid pattern: use a bi-encoder to fetch the top-100, then a cross-encoder to re-rank.

What’s clever — find the instinct

The first clever recognition: pretrained BERT embeddings without fine-tuning are terrible for similarity. The [CLS] token vector is supposed to be a sentence summary, but in practice it scores worse than averaging GloVe word embeddings — it has been trained for masked-language-model and next-sentence-prediction objectives, not for similarity. You can’t just plug pretrained BERT into a vector index and get good results.

“We found that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity.”

The second clever move: use NLI as the pretraining signal for similarity. NLI has 1M+ labeled sentence pairs, and the entailment/contradiction labels are precisely what you want to encode in vector space — semantically equivalent sentences should be near each other; contradictory ones should be far. Fine-tuning a siamese network on NLI gives the embedding space the structure that cosine similarity needs.

“We use NLI data to fine-tune SBERT to produce sentence embeddings.”

The third clever move: mean pooling over all tokens, not just [CLS]. The [CLS] token was trained for next-sentence-prediction, not for capturing semantic meaning of the sentence. Mean pooling forces every token to contribute, producing a more uniformly informative vector.

“Pooling-strategy […] MEAN-strategy works the best.”

Does it work? What breaks?

Speed comparison (10K sentences, finding the most similar pair):

Method	Time
BERT cross-encoder	~65 hours
InferSent	~6 seconds
Universal Sentence Encoder	~5 seconds
Sentence-BERT	~5 seconds

Accuracy on transfer tasks (SentEval benchmark, average score):

Method	Avg
GloVe	60.6
InferSent	76.0
SBERT	80.1

SBERT beats prior fast-retrieval methods (InferSent, USE) on accuracy and matches them on speed.

What breaks:

Cross-encoder still wins on accuracy when speed isn’t a constraint. For re-ranking the top-K from a bi-encoder, a cross-encoder gives 5-10 points more on hard tasks.
Out-of-domain transfer is uneven. SBERT trained on NLI works well for general semantic similarity but may need domain-specific fine-tuning for legal, medical, or code retrieval.
Sentence length sensitivity. Mean pooling can be biased by sentence length; very short sentences sometimes embed strangely. Modern variants use attention pooling.
Bias in NLI data. SBERT inherits whatever biases are in SNLI and MultiNLI — anglocentric, mostly English, particular topic distribution.

So what?

Sentence-BERT is the foundational architecture for every modern dense-retrieval system. The bi-encoder pattern — encode independently, compare via cosine, index with FAISS — is the backbone of:

RAG systems: retrieve relevant chunks, then condition the LLM on them.
Semantic search: replacing BM25 in user-facing search.
Deduplication: SBERT vector similarity is a fast clusterer.
Recommendation: text-based content similarity.
Multilingual cross-lingual retrieval: train SBERT on aligned multilingual pairs.

For Saikat’s work on POI dedup and Indonesian address normalization, SBERT is the canonical retrieval backbone. The pattern:

Address normalization SaaS: encode every input address with a domain-fine-tuned SBERT; index canonical addresses; lookup via cosine. “Jl Sudirman No 23” and “Jalan Sudirman 23” land near each other; “Sudirman 23” maps to the same canonical record.
POI dedup: encode POI name + address text; cluster by cosine. Combine with the CatBoost geometric features for the final dedup decision.
Trajectory-with-text: encode the route’s stop names; combine with t2vec trajectory embedding.

The deeper principle: the bi-encoder/cross-encoder split is a fundamental design choice in retrieval. Bi-encoder is fast and offline-indexable but loses fine-grained interaction. Cross-encoder is precise but linear in candidate count. The modern compromise — bi-encoder for the first stage, cross-encoder for re-ranking — was made viable by SBERT making the bi-encoder accuracy-competitive in the first place. ColBERT later finds a middle ground (late interaction).

“We present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.”

For the practitioner: when in doubt, start with a SBERT-style bi-encoder. The Hugging Face sentence-transformers library is the canonical implementation. Modern descendants (BGE, GTE, E5, Mistral-Embed) are drop-in replacements with better accuracy.

Connections

bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — SBERT uses pretrained BERT as its encoder
rag-retrieval-augmented-generation — SBERT-style retrievers are the default RAG backbone
sentence-embeddings — SBERT is the foundational paper for the concept
siamese-networks — the architecture pattern
contrastive-learning — NLI-based fine-tuning is a contrastive-learning instance
bi-encoder — SBERT introduces the bi-encoder vs cross-encoder distinction in modern NLP
semantic-similarity — the canonical evaluation domain

Citation

arXiv:1908.10084

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084

ML Wiki

Explorer