The Problem

Word embeddings (Word2Vec, GloVe) gave us vectors per word. But most useful comparisons happen at the sentence or paragraph level: “is this query about the same topic as this passage?”, “is this customer review a complaint?”, “do these two product descriptions match?“. The naive fix — average the word vectors — destroys word order and ignores syntactic structure (“the dog bit the man” averages to the same thing as “the man bit the dog”). For BERT-era systems, you could feed two sentences together through a cross-encoder and read out a similarity score, but this is O(N) BERT runs per query — useless at scale. The problem: how to compute meaningful sentence-level vectors that work as standalone, comparable, indexable units.

The Key Insight

You can fine-tune a transformer encoder to produce sentence vectors directly — not as a byproduct of next-token prediction or masked-LM, but as the explicit goal — by training on pairs that should or shouldn’t be similar. This produces a bi-encoder: each sentence gets one fixed-dim vector, and similarity is just cosine. The architecture’s expressiveness comes from BERT-scale pretraining; the similarity geometry comes from contrastive fine-tuning.

Mechanism in Plain English

  1. Start with a pretrained transformer (BERT, RoBERTa, DistilBERT, or modern decoder-encoder hybrids).
  2. Add a pooling step: mean-pool the token outputs (preferred) or take the [CLS] token (worse) or attention-pool (best for some tasks).
  3. Fine-tune on labeled or weakly-supervised pairs: training pulls the embeddings of similar pairs together, pushes embeddings of dissimilar pairs apart. The standard loss is contrastive (InfoNCE) with in-batch negatives.
  4. At inference: encode each sentence once into a vector. Store in a vector index (FAISS, Annoy, Vespa). For any query, encode once and retrieve via approximate nearest neighbor.

ASCII Diagram

SENTENCE A        SENTENCE B
    |                 |
[BERT/encoder]   [BERT/encoder]    (shared weights = "siamese")
    |                 |
[mean pool]       [mean pool]
    |                 |
   u (768d)         v (768d)
       \             /
        \           /
         cosine sim   <-- training target = entailment label / contrastive

Math with Translation

The contrastive (InfoNCE) loss with in-batch negatives:

  • = encoder output for query .
  • = encoder output for the positive (paired) document.
  • for = in-batch negatives (other documents in the same batch).
  • = temperature, typically 0.02-0.07.
  • The numerator scores the true positive; the denominator includes all in-batch comparisons.

The model learns to make much larger than for — i.e., to push positives close and negatives apart in the cosine geometry.

Concrete Walkthrough

TRAINING BATCH (size 4):
  q1 = "How tall is the Eiffel Tower?"           p1 = "The Eiffel Tower is 330m tall."
  q2 = "What is the capital of France?"           p2 = "Paris is the capital of France."
  q3 = "When was the Eiffel Tower built?"         p3 = "Construction began in 1887."
  q4 = "How many people visit annually?"          p4 = "About 7 million tourists visit per year."

ENCODE EACH:
  u1, u2, u3, u4   (queries, 4 vectors of 768d)
  v1, v2, v3, v4   (passages, 4 vectors of 768d)

DOT PRODUCT MATRIX (4x4):
        v1    v2    v3    v4
  u1  [0.81  0.42  0.65  0.31]    <- u1 should match v1 (highest)
  u2  [0.38  0.78  0.41  0.29]    <- u2 should match v2
  u3  [0.59  0.45  0.84  0.33]    <- u3 should match v3
  u4  [0.32  0.34  0.39  0.79]    <- u4 should match v4

LOSS: cross-entropy over each row -> gradient pulls diagonal up,
       pushes off-diagonal down.

What’s Clever

The clever move is decoupling encoding from comparison. With a cross-encoder, encoding and comparison happen jointly — every (query, doc) pair needs its own forward pass. With sentence embeddings, encoding happens once per sentence (offline), and comparison reduces to a vector dot product (online). This gives 1000-10000x speedup at modest accuracy cost — for many production retrieval systems, the right trade.

The second clever recognition: off-the-shelf BERT embeddings are bad for similarity. Pretrained BERT was optimized for language modeling, not for cosine geometry — it has no incentive to produce well-spaced vectors. The fine-tuning step is what turns BERT into a sentence encoder. Skipping it gives you embeddings worse than averaged GloVe.

Code

# Using sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer('BAAI/bge-large-en')
 
queries = ["How tall is the Eiffel Tower?", "Capital of France?"]
docs = ["The Eiffel Tower is 330m tall.", "Paris is the capital."]
 
q_vecs = model.encode(queries, normalize_embeddings=True)  # (2, 1024)
d_vecs = model.encode(docs,    normalize_embeddings=True)  # (2, 1024)
 
# Cosine similarity matrix (since both are L2-normalized, dot = cosine)
sim = q_vecs @ d_vecs.T  # (2, 2)
print(sim)
# [[0.82  0.31]
#  [0.34  0.78]]

Key Sources

Open Questions

  • Long-context: most sentence encoders cap at 512 tokens. How to embed a 100-page document well is still open.
  • Asymmetric encoders: should queries and passages use different encoders or different prompts? Modern E5 uses different prompts; some research suggests separate encoders are better but more expensive.
  • Distillation: how small can a sentence encoder get while preserving MTEB performance? MiniLM-L6 (22M params) is the current Pareto frontier for speed.
  • Multilingual scaling: does training on 100 languages improve or hurt high-resource performance? BGE-m3 suggests it slightly helps with careful curation.