The Problem

Cross-encoders (e.g., feeding [query] [SEP] [doc] through BERT) are great at scoring how well a query matches a document — but they’re computationally infeasible for retrieval. To find the most relevant doc in a corpus of 1M, you’d need 1M BERT forward passes per query. At ~25ms per pass, that’s 7 hours. Unusable. The problem: how to retain BERT’s representational power for similarity scoring while making the per-query cost sublinear in corpus size.

The Key Insight

Encode the query and document independently. Each becomes a single fixed-dim vector. Similarity is just cosine (or dot product). Document vectors are computed offline once and indexed; query encoding is one forward pass; the per-query cost becomes O(query encoding) + O(ANN lookup), independent of corpus size.

Mechanism in Plain English

  1. Build two encoder towers (often sharing weights — “siamese”) based on a pretrained transformer.
  2. Encode each input independently, pool to a single vector.
  3. Train with a contrastive objective: positive pairs (query, relevant doc) should have high cosine similarity; negative pairs should have low.
  4. At inference time:
    • Encode all documents once → store vectors in an index.
    • Encode each new query once → look up nearest neighbors.

ASCII Diagram

                       BI-ENCODER (Sentence-BERT, BGE)
QUERY  ----[encoder]---->  vec_q  ____
                                       \
                                        cosine ----> score
                                       /
DOC    ----[encoder]---->  vec_d  ____/
       (precomputed offline, indexed)


                       CROSS-ENCODER (BERT pair input)
QUERY [SEP] DOC ----[encoder]----> [score]
       (must run per pair, no offline indexing)

Math with Translation

For a bi-encoder with shared weights:

  • = the encoder (typically a transformer + pooling).
  • = query text, = document text.
  • The function factors across and — encode each independently, then cheap cosine.

Compare to cross-encoder:

  • The function is non-factorizable: the encoder sees both inputs together.
  • This is why cross-encoders capture fine-grained interactions (token-level attention between query and doc) but can’t precompute.

Concrete Walkthrough

Latency comparison for 1M-doc retrieval, batch size 1:

CROSS-ENCODER:
  Per query: 1M forward passes * 25ms = 25,000 seconds = 7 hours.
  Memory:    O(1) per query.

BI-ENCODER:
  Setup: 1M forward passes (offline, one-time) -> 1M vectors stored in FAISS.
  Per query: 1 forward pass (~25ms) + ANN lookup (~5ms) = 30ms.
  Memory:    O(N) for the index. 1M docs * 1024d FP16 = ~2GB.

SPEEDUP: 5,000,000x (online), with ~5-10 point accuracy drop on hard queries.

What’s Clever

The clever move is trading representation expressiveness for index-ability. Cross-encoders see both inputs jointly — the model can attend across query and document tokens, capturing fine-grained interactions. Bi-encoders can’t. But this lossy compression of “what matters about doc D” into a fixed vector is the price of being able to precompute and index.

The second clever recognition: the loss is what makes the bi-encoder work. A vanilla pretrained BERT in bi-encoder mode is bad — its embeddings aren’t structured for cosine. The contrastive fine-tuning step is what gives the embedding space the geometry that makes cosine similarity meaningful.

Code

# Bi-encoder retrieval with sentence-transformers + FAISS
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
 
model = SentenceTransformer('BAAI/bge-large-en')
 
# OFFLINE: encode and index all documents
docs = ["doc 1 text", "doc 2 text", "doc 3 text", ...]  # 1M docs
doc_vecs = model.encode(docs, normalize_embeddings=True, batch_size=128)
index = faiss.IndexFlatIP(1024)  # inner product since vecs are normalized
index.add(doc_vecs)
 
# ONLINE: encode query, search
query_vec = model.encode(["query text"], normalize_embeddings=True)
scores, indices = index.search(query_vec, k=10)
top_docs = [docs[i] for i in indices[0]]

Key Sources

Open Questions

  • Hybrid retrieval: how to combine bi-encoders with BM25 (lexical) or ColBERT (late-interaction) optimally? Modern systems use a bi-encoder for first-stage and cross-encoder/ColBERT for re-ranking.
  • Asymmetric encoders: should query and document encoders share weights? Most systems share; some research argues for separate.
  • Vector compression: can bi-encoder vectors be compressed to 1-2 bytes per dim with minimal accuracy loss? Modern PQ indexes get close.
  • Long-context bi-encoders: how to extend beyond 512 tokens? BGE-m3 (8192 tokens) is the current frontier.