The Problem
Cross-encoders (e.g., feeding [query] [SEP] [doc] through BERT) are great at scoring how well a query matches a document — but they’re computationally infeasible for retrieval. To find the most relevant doc in a corpus of 1M, you’d need 1M BERT forward passes per query. At ~25ms per pass, that’s 7 hours. Unusable. The problem: how to retain BERT’s representational power for similarity scoring while making the per-query cost sublinear in corpus size.
The Key Insight
Encode the query and document independently. Each becomes a single fixed-dim vector. Similarity is just cosine (or dot product). Document vectors are computed offline once and indexed; query encoding is one forward pass; the per-query cost becomes O(query encoding) + O(ANN lookup), independent of corpus size.
Mechanism in Plain English
- Build two encoder towers (often sharing weights — “siamese”) based on a pretrained transformer.
- Encode each input independently, pool to a single vector.
- Train with a contrastive objective: positive pairs (query, relevant doc) should have high cosine similarity; negative pairs should have low.
- At inference time:
- Encode all documents once → store vectors in an index.
- Encode each new query once → look up nearest neighbors.
ASCII Diagram
BI-ENCODER (Sentence-BERT, BGE)
QUERY ----[encoder]----> vec_q ____
\
cosine ----> score
/
DOC ----[encoder]----> vec_d ____/
(precomputed offline, indexed)
CROSS-ENCODER (BERT pair input)
QUERY [SEP] DOC ----[encoder]----> [score]
(must run per pair, no offline indexing)
Math with Translation
For a bi-encoder with shared weights:
- = the encoder (typically a transformer + pooling).
- = query text, = document text.
- The function factors across and — encode each independently, then cheap cosine.
Compare to cross-encoder:
- The function is non-factorizable: the encoder sees both inputs together.
- This is why cross-encoders capture fine-grained interactions (token-level attention between query and doc) but can’t precompute.
Concrete Walkthrough
Latency comparison for 1M-doc retrieval, batch size 1:
CROSS-ENCODER:
Per query: 1M forward passes * 25ms = 25,000 seconds = 7 hours.
Memory: O(1) per query.
BI-ENCODER:
Setup: 1M forward passes (offline, one-time) -> 1M vectors stored in FAISS.
Per query: 1 forward pass (~25ms) + ANN lookup (~5ms) = 30ms.
Memory: O(N) for the index. 1M docs * 1024d FP16 = ~2GB.
SPEEDUP: 5,000,000x (online), with ~5-10 point accuracy drop on hard queries.
What’s Clever
The clever move is trading representation expressiveness for index-ability. Cross-encoders see both inputs jointly — the model can attend across query and document tokens, capturing fine-grained interactions. Bi-encoders can’t. But this lossy compression of “what matters about doc D” into a fixed vector is the price of being able to precompute and index.
The second clever recognition: the loss is what makes the bi-encoder work. A vanilla pretrained BERT in bi-encoder mode is bad — its embeddings aren’t structured for cosine. The contrastive fine-tuning step is what gives the embedding space the geometry that makes cosine similarity meaningful.
Code
# Bi-encoder retrieval with sentence-transformers + FAISS
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer('BAAI/bge-large-en')
# OFFLINE: encode and index all documents
docs = ["doc 1 text", "doc 2 text", "doc 3 text", ...] # 1M docs
doc_vecs = model.encode(docs, normalize_embeddings=True, batch_size=128)
index = faiss.IndexFlatIP(1024) # inner product since vecs are normalized
index.add(doc_vecs)
# ONLINE: encode query, search
query_vec = model.encode(["query text"], normalize_embeddings=True)
scores, indices = index.search(query_vec, k=10)
top_docs = [docs[i] for i in indices[0]]Key Sources
- sentence-bert-siamese-bert-networks — the foundational bi-encoder paper
- bge-c-pack-general-chinese-embeddings — modern bi-encoder SOTA
- mteb-massive-text-embedding-benchmark — the canonical evaluation
- colbert-late-interaction-retrieval — bi-encoder hybrid that keeps per-token vectors
Related Concepts
- sentence-embeddings — bi-encoders produce sentence embeddings as their output
- contrastive-learning — the standard training paradigm
- late-interaction — the ColBERT alternative that adds back fine-grained interaction
- semantic-similarity — the standard evaluation domain
Open Questions
- Hybrid retrieval: how to combine bi-encoders with BM25 (lexical) or ColBERT (late-interaction) optimally? Modern systems use a bi-encoder for first-stage and cross-encoder/ColBERT for re-ranking.
- Asymmetric encoders: should query and document encoders share weights? Most systems share; some research argues for separate.
- Vector compression: can bi-encoder vectors be compressed to 1-2 bytes per dim with minimal accuracy loss? Modern PQ indexes get close.
- Long-context bi-encoders: how to extend beyond 512 tokens? BGE-m3 (8192 tokens) is the current frontier.