Bi-Encoder

The Problem

Cross-encoders (e.g., feeding [query] [SEP] [doc] through BERT) are great at scoring how well a query matches a document — but they’re computationally infeasible for retrieval. To find the most relevant doc in a corpus of 1M, you’d need 1M BERT forward passes per query. At ~25ms per pass, that’s 7 hours. Unusable. The problem: how to retain BERT’s representational power for similarity scoring while making the per-query cost sublinear in corpus size.

The Key Insight

Encode the query and document independently. Each becomes a single fixed-dim vector. Similarity is just cosine (or dot product). Document vectors are computed offline once and indexed; query encoding is one forward pass; the per-query cost becomes O(query encoding) + O(ANN lookup), independent of corpus size.

Mechanism in Plain English

Build two encoder towers (often sharing weights — “siamese”) based on a pretrained transformer.
Encode each input independently, pool to a single vector.
Train with a contrastive objective: positive pairs (query, relevant doc) should have high cosine similarity; negative pairs should have low.
At inference time:
- Encode all documents once → store vectors in an index.
- Encode each new query once → look up nearest neighbors.

ASCII Diagram

                       BI-ENCODER (Sentence-BERT, BGE)
QUERY  ----[encoder]---->  vec_q  ____
                                       \
                                        cosine ----> score
                                       /
DOC    ----[encoder]---->  vec_d  ____/
       (precomputed offline, indexed)


                       CROSS-ENCODER (BERT pair input)
QUERY [SEP] DOC ----[encoder]----> [score]
       (must run per pair, no offline indexing)

Math with Translation

For a bi-encoder with shared weights:

$score (q, d) = cosine (f_{θ} (q), f_{θ} (d))$

$f_{θ}$ = the encoder (typically a transformer + pooling).
$q$ = query text, $d$ = document text.
The function factors across $q$ and $d$ — encode each independently, then cheap cosine.

Compare to cross-encoder:

$score (q, d) = g_{θ} ([q; d])$

The function is non-factorizable: the encoder sees both inputs together.
This is why cross-encoders capture fine-grained interactions (token-level attention between query and doc) but can’t precompute.

Concrete Walkthrough

Latency comparison for 1M-doc retrieval, batch size 1:

CROSS-ENCODER:
  Per query: 1M forward passes * 25ms = 25,000 seconds = 7 hours.
  Memory:    O(1) per query.

BI-ENCODER:
  Setup: 1M forward passes (offline, one-time) -> 1M vectors stored in FAISS.
  Per query: 1 forward pass (~25ms) + ANN lookup (~5ms) = 30ms.
  Memory:    O(N) for the index. 1M docs * 1024d FP16 = ~2GB.

SPEEDUP: 5,000,000x (online), with ~5-10 point accuracy drop on hard queries.

What’s Clever

The clever move is trading representation expressiveness for index-ability. Cross-encoders see both inputs jointly — the model can attend across query and document tokens, capturing fine-grained interactions. Bi-encoders can’t. But this lossy compression of “what matters about doc D” into a fixed vector is the price of being able to precompute and index.

The second clever recognition: the loss is what makes the bi-encoder work. A vanilla pretrained BERT in bi-encoder mode is bad — its embeddings aren’t structured for cosine. The contrastive fine-tuning step is what gives the embedding space the geometry that makes cosine similarity meaningful.

Code

# Bi-encoder retrieval with sentence-transformers + FAISS
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
 
model = SentenceTransformer('BAAI/bge-large-en')
 
# OFFLINE: encode and index all documents
docs = ["doc 1 text", "doc 2 text", "doc 3 text", ...]  # 1M docs
doc_vecs = model.encode(docs, normalize_embeddings=True, batch_size=128)
index = faiss.IndexFlatIP(1024)  # inner product since vecs are normalized
index.add(doc_vecs)
 
# ONLINE: encode query, search
query_vec = model.encode(["query text"], normalize_embeddings=True)
scores, indices = index.search(query_vec, k=10)
top_docs = [docs[i] for i in indices[0]]

Key Sources

sentence-bert-siamese-bert-networks — the foundational bi-encoder paper
bge-c-pack-general-chinese-embeddings — modern bi-encoder SOTA
mteb-massive-text-embedding-benchmark — the canonical evaluation
colbert-late-interaction-retrieval — bi-encoder hybrid that keeps per-token vectors

sentence-embeddings — bi-encoders produce sentence embeddings as their output
contrastive-learning — the standard training paradigm
late-interaction — the ColBERT alternative that adds back fine-grained interaction
semantic-similarity — the standard evaluation domain

Open Questions

Hybrid retrieval: how to combine bi-encoders with BM25 (lexical) or ColBERT (late-interaction) optimally? Modern systems use a bi-encoder for first-stage and cross-encoder/ColBERT for re-ranking.
Asymmetric encoders: should query and document encoders share weights? Most systems share; some research argues for separate.
Vector compression: can bi-encoder vectors be compressed to 1-2 bytes per dim with minimal accuracy loss? Modern PQ indexes get close.
Long-context bi-encoders: how to extend beyond 512 tokens? BGE-m3 (8192 tokens) is the current frontier.

ML Wiki

Explorer

Bi-Encoder

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Bi-Encoder

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks