Semantic Similarity

The Problem

When are two pieces of text “the same”? Lexical match (Jaccard, BM25) catches direct overlap but misses paraphrase. Edit distance catches typos but misses synonyms. Hand-crafted features (WordNet synsets, named-entity overlap) catch some semantics but require linguistic engineering. The general problem: define a metric on text that respects meaning, not surface form.

The Key Insight

Reduce both texts to dense vectors in a learned space, then use cosine similarity (or any vector distance). The vectors are produced by a model trained so that semantically similar text maps to nearby vectors. The training signal can come from labeled pairs (NLI entailment, paraphrase corpora), from natural pairs (question-answer, click-through), or from self-supervised corruption (mask-and-reconstruct).

Mechanism in Plain English

Represent text as dense vectors via an embedding model (sentence transformer, BERT pooling, etc.).
Compute cosine similarity: $cos (u, v) = u \cdot v / (∥ u ∥∥ v ∥)$ .
Threshold or rank as needed.

The crucial detail: the embedding space must be trained for similarity. Off-the-shelf BERT vectors are bad at this. SBERT-style or BGE-style fine-tuning is what makes the cosine geometry meaningful.

ASCII Diagram

        SEMANTIC SIMILARITY VS LEXICAL SIMILARITY:

  T1 = "How tall is the Eiffel Tower?"
  T2 = "What is the Eiffel Tower's height?"
  T3 = "How tall is Mount Everest?"

  LEXICAL (BM25):
    sim(T1, T2) = 0.65 (shared "Eiffel Tower")
    sim(T1, T3) = 0.55 (shared "How tall is")
    Verdict: T2 wins, but by a small margin.

  SEMANTIC (BGE):
    sim(T1, T2) = 0.93 (asking same question)
    sim(T1, T3) = 0.41 (different topic)
    Verdict: T2 dominates clearly.

Math with Translation

The standard cosine similarity:

$cos (u, v) = \frac{u \cdot v}{∥ u ∥ \cdot ∥ v ∥}$

Range: [-1, 1]; for L2-normalized vectors (the standard case): just $u \cdot v$ .
Cosine ignores magnitude, only direction matters. Two vectors pointing the same way have similarity 1; orthogonal have 0.

Most sentence-embedding models L2-normalize their output, so cosine simplifies to a dot product:

$score (T_{1}, T_{2}) = f_{θ} (T_{1}) \cdot f_{θ} (T_{2})$

The training objective is what makes $score$ semantically meaningful — typically InfoNCE with in-batch negatives.

Concrete Walkthrough

STS-Benchmark (the standard semantic similarity test): human-annotated similarity scores from 0 (unrelated) to 5 (identical) on 8K English sentence pairs. Models output their cosine similarity; the metric is Spearman correlation between model scores and human labels.

EXAMPLE PAIRS WITH HUMAN LABELS:

(5.0) "A man is playing guitar."
       "A man plays guitar."

(3.4) "A boy is jumping into water."
       "A child is splashing in the pool."

(0.5) "A black dog is running through grass."
       "A man is cooking dinner."

A model that outputs cosine similarities of 0.94, 0.71, 0.18 respectively
would correlate well with these labels.

State-of-the-art models score Spearman ~84-86 on STS-Benchmark; humans agree at ~85.

What’s Clever

The clever recognition: semantic similarity is fundamentally a learned function, not a structural one. People spent decades trying to define semantic similarity by hand — WordNet path lengths, semantic role parsing, frame semantics. None of it worked at scale. The neural-network approach: define a similarity loss, train on enough pairs, let the model figure out the function. The resulting cosine similarity in embedding space is the semantic similarity for downstream applications.

The second clever recognition: most pretrained models embed badly for similarity. Pretrained BERT, GPT, T5 all produce vectors — but the geometries are optimized for next-token prediction or masked LM, not for cosine alignment of meaning-equivalent text. Fine-tuning is mandatory. SBERT-style contrastive fine-tuning is the canonical way to get there.

Key Sources

sentence-bert-siamese-bert-networks — the foundational SBERT paper
colbert-late-interaction-retrieval — late-interaction variant for fine-grained semantic similarity
bge-c-pack-general-chinese-embeddings — modern SOTA on STS
mteb-massive-text-embedding-benchmark — STS is one of MTEB’s 8 task categories
t2vec-deep-representation-learning-trajectory-similarity

sentence-embeddings — the underlying representation
contrastive-learning — the training paradigm
bi-encoder — the standard architecture for similarity-by-cosine
late-interaction — fine-grained alternative to single-vector cosine

Open Questions

What is “similarity” really measuring? Different downstream tasks need different notions: paraphrase vs entailment vs topic vs intent. A single embedding can’t be best for all.
Asymmetric similarity: “X is a dog” should be similar to “X is an animal” (entailment) but not vice versa. Cosine is symmetric. How to model directional similarity?
Cross-domain transfer: STS-Benchmark trained models can fail on legal/medical/code text. Domain-specific fine-tuning needed but hard to get right.
Calibration: cosine 0.7 in one model is “very similar”; in another it’s “barely related.” How to compare across models / set thresholds reliably?

ML Wiki

Explorer

Semantic Similarity

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Semantic Similarity

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks