The Problem
Word embeddings (Word2Vec, GloVe) gave us vectors per word. But most useful comparisons happen at the sentence or paragraph level: “is this query about the same topic as this passage?”, “is this customer review a complaint?”, “do these two product descriptions match?“. The naive fix — average the word vectors — destroys word order and ignores syntactic structure (“the dog bit the man” averages to the same thing as “the man bit the dog”). For BERT-era systems, you could feed two sentences together through a cross-encoder and read out a similarity score, but this is O(N) BERT runs per query — useless at scale. The problem: how to compute meaningful sentence-level vectors that work as standalone, comparable, indexable units.
The Key Insight
You can fine-tune a transformer encoder to produce sentence vectors directly — not as a byproduct of next-token prediction or masked-LM, but as the explicit goal — by training on pairs that should or shouldn’t be similar. This produces a bi-encoder: each sentence gets one fixed-dim vector, and similarity is just cosine. The architecture’s expressiveness comes from BERT-scale pretraining; the similarity geometry comes from contrastive fine-tuning.
Mechanism in Plain English
- Start with a pretrained transformer (BERT, RoBERTa, DistilBERT, or modern decoder-encoder hybrids).
- Add a pooling step: mean-pool the token outputs (preferred) or take the
[CLS]token (worse) or attention-pool (best for some tasks). - Fine-tune on labeled or weakly-supervised pairs: training pulls the embeddings of similar pairs together, pushes embeddings of dissimilar pairs apart. The standard loss is contrastive (InfoNCE) with in-batch negatives.
- At inference: encode each sentence once into a vector. Store in a vector index (FAISS, Annoy, Vespa). For any query, encode once and retrieve via approximate nearest neighbor.
ASCII Diagram
SENTENCE A SENTENCE B
| |
[BERT/encoder] [BERT/encoder] (shared weights = "siamese")
| |
[mean pool] [mean pool]
| |
u (768d) v (768d)
\ /
\ /
cosine sim <-- training target = entailment label / contrastive
Math with Translation
The contrastive (InfoNCE) loss with in-batch negatives:
- = encoder output for query .
- = encoder output for the positive (paired) document.
- for = in-batch negatives (other documents in the same batch).
- = temperature, typically 0.02-0.07.
- The numerator scores the true positive; the denominator includes all in-batch comparisons.
The model learns to make much larger than for — i.e., to push positives close and negatives apart in the cosine geometry.
Concrete Walkthrough
TRAINING BATCH (size 4):
q1 = "How tall is the Eiffel Tower?" p1 = "The Eiffel Tower is 330m tall."
q2 = "What is the capital of France?" p2 = "Paris is the capital of France."
q3 = "When was the Eiffel Tower built?" p3 = "Construction began in 1887."
q4 = "How many people visit annually?" p4 = "About 7 million tourists visit per year."
ENCODE EACH:
u1, u2, u3, u4 (queries, 4 vectors of 768d)
v1, v2, v3, v4 (passages, 4 vectors of 768d)
DOT PRODUCT MATRIX (4x4):
v1 v2 v3 v4
u1 [0.81 0.42 0.65 0.31] <- u1 should match v1 (highest)
u2 [0.38 0.78 0.41 0.29] <- u2 should match v2
u3 [0.59 0.45 0.84 0.33] <- u3 should match v3
u4 [0.32 0.34 0.39 0.79] <- u4 should match v4
LOSS: cross-entropy over each row -> gradient pulls diagonal up,
pushes off-diagonal down.
What’s Clever
The clever move is decoupling encoding from comparison. With a cross-encoder, encoding and comparison happen jointly — every (query, doc) pair needs its own forward pass. With sentence embeddings, encoding happens once per sentence (offline), and comparison reduces to a vector dot product (online). This gives 1000-10000x speedup at modest accuracy cost — for many production retrieval systems, the right trade.
The second clever recognition: off-the-shelf BERT embeddings are bad for similarity. Pretrained BERT was optimized for language modeling, not for cosine geometry — it has no incentive to produce well-spaced vectors. The fine-tuning step is what turns BERT into a sentence encoder. Skipping it gives you embeddings worse than averaged GloVe.
Code
# Using sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-large-en')
queries = ["How tall is the Eiffel Tower?", "Capital of France?"]
docs = ["The Eiffel Tower is 330m tall.", "Paris is the capital."]
q_vecs = model.encode(queries, normalize_embeddings=True) # (2, 1024)
d_vecs = model.encode(docs, normalize_embeddings=True) # (2, 1024)
# Cosine similarity matrix (since both are L2-normalized, dot = cosine)
sim = q_vecs @ d_vecs.T # (2, 2)
print(sim)
# [[0.82 0.31]
# [0.34 0.78]]Key Sources
- sentence-bert-siamese-bert-networks — the foundational paper
- colbert-late-interaction-retrieval — keeps per-token vectors, recovers cross-encoder accuracy
- bge-c-pack-general-chinese-embeddings — modern open-source SOTA
- mteb-massive-text-embedding-benchmark — the canonical evaluation
- word2vec-efficient-estimation-word-representations — the conceptual ancestor; per-word version
Related Concepts
- contrastive-learning — the loss used to train sentence embeddings
- bi-encoder — the architecture pattern
- semantic-similarity — the canonical task
- multimodal-embeddings — the cross-modal generalization (CLIP, SigLIP)
- word-embeddings — sentence embeddings extend the idea from words
Open Questions
- Long-context: most sentence encoders cap at 512 tokens. How to embed a 100-page document well is still open.
- Asymmetric encoders: should queries and passages use different encoders or different prompts? Modern E5 uses different prompts; some research suggests separate encoders are better but more expensive.
- Distillation: how small can a sentence encoder get while preserving MTEB performance? MiniLM-L6 (22M params) is the current Pareto frontier for speed.
- Multilingual scaling: does training on 100 languages improve or hurt high-resource performance? BGE-m3 suggests it slightly helps with careful curation.