The Problem

Suppose you want to learn a similarity function: given two inputs (images, sentences, fingerprints), output a similarity score. You could train a single network that takes both inputs concatenated — but this has limitations: you can’t pre-encode either input independently (so retrieval is slow), and the model has to learn each pair-wise relationship from scratch instead of learning a general embedding space.

The Key Insight

Use two copies of the same network with shared weights. Pass each input through one copy independently, producing two embeddings. The similarity is then computed on the embeddings (cosine, Euclidean, or a small downstream head). Sharing weights means there’s really one network — the “twin” structure is logical, not physical. The network learns a single embedding function that simultaneously serves both inputs.

The architecture’s power: the embedding function is invariant to which input it processes. A query and a document go through the same encoder. This is what makes the embeddings comparable.

Mechanism in Plain English

  1. Define a single encoder — a CNN, RNN, transformer, whatever fits the modality.
  2. Take a pair . Compute and independently.
  3. Compute the similarity / distance between and (cosine, Euclidean, or a learned head).
  4. Train with a loss that pulls similar pairs together (positive pairs) and pushes dissimilar pairs apart (negative pairs).

ASCII Diagram

INPUT 1                       INPUT 2
   |                              |
[ENCODER (theta)]            [ENCODER (theta)]    <- SAME WEIGHTS
   |                              |
   u                              v
       \                       /
        \                     /
       similarity / distance
                |
             loss (contrastive, triplet, MSE, etc.)

The ENCODER block is one network used twice. During backprop, gradients flow through both paths and accumulate on the shared weights.

Math with Translation

Contrastive loss (Hadsell, Chopra, LeCun 2006):

  • — Euclidean distance.
  • if pair is similar, if dissimilar.
  • = margin, a hyperparameter.
  • For similar pairs: loss = , drives distance to zero.
  • For dissimilar pairs: loss = , drives distance to be at least .

Triplet loss (Schroff, FaceNet 2015):

  • = anchor, = positive, = negative.
  • = margin.
  • Drives the anchor to be closer to the positive than to the negative by at least .

InfoNCE / Contrastive Cross-Entropy (modern default):

  • = temperature.
  • = the positive paired with .
  • = negatives (often in-batch).

Concrete Walkthrough

Sentence-BERT in siamese mode (NLI training):

INPUT PAIR:
  Sentence A: "A man is eating a sandwich."
  Sentence B: "A man is eating food."
  Label: entailment (label = 1, "similar")

SIAMESE ENCODE:
  u = SBERT_encoder(A)  -- 768-dim vector
  v = SBERT_encoder(B)  -- 768-dim vector
  (Same encoder weights, applied to both inputs.)

COMPUTE SIMILARITY:
  cosine(u, v) = 0.91

LOSS:
  Train so cosine ~= 1 for entailment pairs, ~ 0 for neutral, ~ -1 for contradiction.
  Backprop through both paths; gradients accumulate on the shared encoder.

AFTER TRAINING:
  Encode any new sentence X via u_X = SBERT_encoder(X).
  Compare any pair via cosine(u_X, u_Y).

Compare to a non-siamese setup (e.g., two BERTs with separate weights for query vs document): the encoders would diverge, and you couldn’t compute query-query or doc-doc similarity meaningfully.

What’s Clever

The clever move is weight sharing as architectural inductive bias. By forcing both encoders to be the same function, you guarantee that if input X and Y are processed through the network, their embeddings are in the same space. This is what makes nearest-neighbor lookup, deduplication, and all the downstream applications possible. Non-siamese architectures don’t have this property — the “query encoder” output and “document encoder” output live in different spaces.

The second clever recognition: siamese networks are differentiable and trainable end-to-end despite the structural duplication. The duplication is purely logical at training time — the parameter set is one. The clever implementation: forward pass through both inputs in a single batch (interleave query and doc within the batch); compute embeddings; compute similarity loss; standard backprop. PyTorch and TF handle this seamlessly.

Code

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class SiameseSentenceEncoder(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.encoder = base_model  # one BERT, shared
 
    def forward(self, sentences_a, sentences_b):
        u = self.encoder(sentences_a)  # (batch, dim)
        v = self.encoder(sentences_b)  # (batch, dim)
        return u, v
 
def info_nce_loss(u, v, tau=0.05):
    # Both u, v are L2-normalized. Cosine = dot product.
    sim = u @ v.T / tau          # (batch, batch)
    labels = torch.arange(u.size(0), device=u.device)
    return F.cross_entropy(sim, labels)

Key Sources

  • contrastive-learning — the standard training paradigm for siamese networks
  • bi-encoder — bi-encoder is the deployment-time view of a siamese network
  • sentence-embeddings — the typical output of a siamese sentence encoder
  • multimodal-embeddings — CLIP is a siamese-style network across modalities (different encoders per modality, shared output space)

Open Questions

  • When to break weight sharing? Some retrieval research argues for separate query and doc encoders (asymmetric SBERT). Trade-off: more parameters vs better modeling of asymmetry.
  • Multi-tower siamese: 3+ towers for triplet relationships (e.g., text-image-audio joint embedding). When does this help vs hurt?
  • Mining hard pairs: siamese training depends critically on negative selection. Hard negative mining is more art than science.
  • Distillation: can a single-tower student learn what a siamese teacher knows? Generally yes; this is how DistilBERT-based sentence encoders are built.