Sentence Embeddings

The Problem

Word embeddings (Word2Vec, GloVe) gave us vectors per word. But most useful comparisons happen at the sentence or paragraph level: “is this query about the same topic as this passage?”, “is this customer review a complaint?”, “do these two product descriptions match?“. The naive fix — average the word vectors — destroys word order and ignores syntactic structure (“the dog bit the man” averages to the same thing as “the man bit the dog”). For BERT-era systems, you could feed two sentences together through a cross-encoder and read out a similarity score, but this is O(N) BERT runs per query — useless at scale. The problem: how to compute meaningful sentence-level vectors that work as standalone, comparable, indexable units.

The Key Insight

You can fine-tune a transformer encoder to produce sentence vectors directly — not as a byproduct of next-token prediction or masked-LM, but as the explicit goal — by training on pairs that should or shouldn’t be similar. This produces a bi-encoder: each sentence gets one fixed-dim vector, and similarity is just cosine. The architecture’s expressiveness comes from BERT-scale pretraining; the similarity geometry comes from contrastive fine-tuning.

Mechanism in Plain English

Start with a pretrained transformer (BERT, RoBERTa, DistilBERT, or modern decoder-encoder hybrids).
Add a pooling step: mean-pool the token outputs (preferred) or take the [CLS] token (worse) or attention-pool (best for some tasks).
Fine-tune on labeled or weakly-supervised pairs: training pulls the embeddings of similar pairs together, pushes embeddings of dissimilar pairs apart. The standard loss is contrastive (InfoNCE) with in-batch negatives.
At inference: encode each sentence once into a vector. Store in a vector index (FAISS, Annoy, Vespa). For any query, encode once and retrieve via approximate nearest neighbor.

ASCII Diagram

SENTENCE A        SENTENCE B
    |                 |
[BERT/encoder]   [BERT/encoder]    (shared weights = "siamese")
    |                 |
[mean pool]       [mean pool]
    |                 |
   u (768d)         v (768d)
       \             /
        \           /
         cosine sim   <-- training target = entailment label / contrastive

Math with Translation

The contrastive (InfoNCE) loss with in-batch negatives:

$L = - \frac{1}{N} \sum_{i = 1}^{N} lo g \frac{e x p ( u _{i} \cdot v _{i} / τ )}{\sum _{j = 1}^{N} e x p ( u _{i} \cdot v _{j} / τ )}$

$u_{i}$ = encoder output for query $i$ .
$v_{i}$ = encoder output for the positive (paired) document.
$v_{j}$ for $j \neq = i$ = in-batch negatives (other documents in the same batch).
$τ$ = temperature, typically 0.02-0.07.
The numerator scores the true positive; the denominator includes all in-batch comparisons.

The model learns to make $u_{i} \cdot v_{i}$ much larger than $u_{i} \cdot v_{j}$ for $j \neq = i$ — i.e., to push positives close and negatives apart in the cosine geometry.

Concrete Walkthrough

TRAINING BATCH (size 4):
  q1 = "How tall is the Eiffel Tower?"           p1 = "The Eiffel Tower is 330m tall."
  q2 = "What is the capital of France?"           p2 = "Paris is the capital of France."
  q3 = "When was the Eiffel Tower built?"         p3 = "Construction began in 1887."
  q4 = "How many people visit annually?"          p4 = "About 7 million tourists visit per year."

ENCODE EACH:
  u1, u2, u3, u4   (queries, 4 vectors of 768d)
  v1, v2, v3, v4   (passages, 4 vectors of 768d)

DOT PRODUCT MATRIX (4x4):
        v1    v2    v3    v4
  u1  [0.81  0.42  0.65  0.31]    <- u1 should match v1 (highest)
  u2  [0.38  0.78  0.41  0.29]    <- u2 should match v2
  u3  [0.59  0.45  0.84  0.33]    <- u3 should match v3
  u4  [0.32  0.34  0.39  0.79]    <- u4 should match v4

LOSS: cross-entropy over each row -> gradient pulls diagonal up,
       pushes off-diagonal down.

What’s Clever

The clever move is decoupling encoding from comparison. With a cross-encoder, encoding and comparison happen jointly — every (query, doc) pair needs its own forward pass. With sentence embeddings, encoding happens once per sentence (offline), and comparison reduces to a vector dot product (online). This gives 1000-10000x speedup at modest accuracy cost — for many production retrieval systems, the right trade.

The second clever recognition: off-the-shelf BERT embeddings are bad for similarity. Pretrained BERT was optimized for language modeling, not for cosine geometry — it has no incentive to produce well-spaced vectors. The fine-tuning step is what turns BERT into a sentence encoder. Skipping it gives you embeddings worse than averaged GloVe.

Code

# Using sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer('BAAI/bge-large-en')
 
queries = ["How tall is the Eiffel Tower?", "Capital of France?"]
docs = ["The Eiffel Tower is 330m tall.", "Paris is the capital."]
 
q_vecs = model.encode(queries, normalize_embeddings=True)  # (2, 1024)
d_vecs = model.encode(docs,    normalize_embeddings=True)  # (2, 1024)
 
# Cosine similarity matrix (since both are L2-normalized, dot = cosine)
sim = q_vecs @ d_vecs.T  # (2, 2)
print(sim)
# [[0.82  0.31]
#  [0.34  0.78]]

Key Sources

sentence-bert-siamese-bert-networks — the foundational paper
colbert-late-interaction-retrieval — keeps per-token vectors, recovers cross-encoder accuracy
bge-c-pack-general-chinese-embeddings — modern open-source SOTA
mteb-massive-text-embedding-benchmark — the canonical evaluation
word2vec-efficient-estimation-word-representations — the conceptual ancestor; per-word version

contrastive-learning — the loss used to train sentence embeddings
bi-encoder — the architecture pattern
semantic-similarity — the canonical task
multimodal-embeddings — the cross-modal generalization (CLIP, SigLIP)
word-embeddings — sentence embeddings extend the idea from words

Open Questions

Long-context: most sentence encoders cap at 512 tokens. How to embed a 100-page document well is still open.
Asymmetric encoders: should queries and passages use different encoders or different prompts? Modern E5 uses different prompts; some research suggests separate encoders are better but more expensive.
Distillation: how small can a sentence encoder get while preserving MTEB performance? MiniLM-L6 (22M params) is the current Pareto frontier for speed.
Multilingual scaling: does training on 100 languages improve or hurt high-resource performance? BGE-m3 suggests it slightly helps with careful curation.

ML Wiki

Explorer

Sentence Embeddings

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Sentence Embeddings

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks