The Problem
When are two pieces of text “the same”? Lexical match (Jaccard, BM25) catches direct overlap but misses paraphrase. Edit distance catches typos but misses synonyms. Hand-crafted features (WordNet synsets, named-entity overlap) catch some semantics but require linguistic engineering. The general problem: define a metric on text that respects meaning, not surface form.
The Key Insight
Reduce both texts to dense vectors in a learned space, then use cosine similarity (or any vector distance). The vectors are produced by a model trained so that semantically similar text maps to nearby vectors. The training signal can come from labeled pairs (NLI entailment, paraphrase corpora), from natural pairs (question-answer, click-through), or from self-supervised corruption (mask-and-reconstruct).
Mechanism in Plain English
- Represent text as dense vectors via an embedding model (sentence transformer, BERT pooling, etc.).
- Compute cosine similarity: .
- Threshold or rank as needed.
The crucial detail: the embedding space must be trained for similarity. Off-the-shelf BERT vectors are bad at this. SBERT-style or BGE-style fine-tuning is what makes the cosine geometry meaningful.
ASCII Diagram
SEMANTIC SIMILARITY VS LEXICAL SIMILARITY:
T1 = "How tall is the Eiffel Tower?"
T2 = "What is the Eiffel Tower's height?"
T3 = "How tall is Mount Everest?"
LEXICAL (BM25):
sim(T1, T2) = 0.65 (shared "Eiffel Tower")
sim(T1, T3) = 0.55 (shared "How tall is")
Verdict: T2 wins, but by a small margin.
SEMANTIC (BGE):
sim(T1, T2) = 0.93 (asking same question)
sim(T1, T3) = 0.41 (different topic)
Verdict: T2 dominates clearly.
Math with Translation
The standard cosine similarity:
- Range: [-1, 1]; for L2-normalized vectors (the standard case): just .
- Cosine ignores magnitude, only direction matters. Two vectors pointing the same way have similarity 1; orthogonal have 0.
Most sentence-embedding models L2-normalize their output, so cosine simplifies to a dot product:
The training objective is what makes semantically meaningful — typically InfoNCE with in-batch negatives.
Concrete Walkthrough
STS-Benchmark (the standard semantic similarity test): human-annotated similarity scores from 0 (unrelated) to 5 (identical) on 8K English sentence pairs. Models output their cosine similarity; the metric is Spearman correlation between model scores and human labels.
EXAMPLE PAIRS WITH HUMAN LABELS:
(5.0) "A man is playing guitar."
"A man plays guitar."
(3.4) "A boy is jumping into water."
"A child is splashing in the pool."
(0.5) "A black dog is running through grass."
"A man is cooking dinner."
A model that outputs cosine similarities of 0.94, 0.71, 0.18 respectively
would correlate well with these labels.
State-of-the-art models score Spearman ~84-86 on STS-Benchmark; humans agree at ~85.
What’s Clever
The clever recognition: semantic similarity is fundamentally a learned function, not a structural one. People spent decades trying to define semantic similarity by hand — WordNet path lengths, semantic role parsing, frame semantics. None of it worked at scale. The neural-network approach: define a similarity loss, train on enough pairs, let the model figure out the function. The resulting cosine similarity in embedding space is the semantic similarity for downstream applications.
The second clever recognition: most pretrained models embed badly for similarity. Pretrained BERT, GPT, T5 all produce vectors — but the geometries are optimized for next-token prediction or masked LM, not for cosine alignment of meaning-equivalent text. Fine-tuning is mandatory. SBERT-style contrastive fine-tuning is the canonical way to get there.
Key Sources
-
sentence-bert-siamese-bert-networks — the foundational SBERT paper
-
colbert-late-interaction-retrieval — late-interaction variant for fine-grained semantic similarity
-
bge-c-pack-general-chinese-embeddings — modern SOTA on STS
-
mteb-massive-text-embedding-benchmark — STS is one of MTEB’s 8 task categories
Related Concepts
- sentence-embeddings — the underlying representation
- contrastive-learning — the training paradigm
- bi-encoder — the standard architecture for similarity-by-cosine
- late-interaction — fine-grained alternative to single-vector cosine
Open Questions
- What is “similarity” really measuring? Different downstream tasks need different notions: paraphrase vs entailment vs topic vs intent. A single embedding can’t be best for all.
- Asymmetric similarity: “X is a dog” should be similar to “X is an animal” (entailment) but not vice versa. Cosine is symmetric. How to model directional similarity?
- Cross-domain transfer: STS-Benchmark trained models can fail on legal/medical/code text. Domain-specific fine-tuning needed but hard to get right.
- Calibration: cosine 0.7 in one model is “very similar”; in another it’s “barely related.” How to compare across models / set thresholds reliably?