The Problem
Bi-encoders are fast but lossy: they compress every document into a single vector. Cross-encoders are accurate but slow: they re-process query+doc together for every pair. Both are extremes. Bi-encoders bottleneck information at the pooling step (one vector per sentence); cross-encoders bottleneck at compute (one BERT run per pair). The middle ground would be independent encoding (so we can precompute and index) but fine-grained per-token interaction (so we don’t lose query-doc semantic alignment).
The Key Insight
Encode each token of the query and each token of the document independently — small per-token vectors, indexable. At query time, compute a tiny interaction step using all token vectors: for each query token, find its best-matching document token (max similarity over all document tokens). Sum across query tokens. This is “late” interaction because the query-doc comparison happens after both have been encoded — nothing in the encoding step depends on the other input.
Mechanism in Plain English
- Encode query: tokens → vectors of dim (small, e.g., 128).
- Encode each document: tokens → vectors of dim . Index them all (offline).
- At query time: for each query token , compute its best match against all document tokens of a candidate document: .
- Sum these per-query-token scores: . This is the “MaxSim” operator.
- Use approximate nearest neighbor on the document tokens (treat the corpus as one big set of token vectors) to identify candidate documents efficiently.
ASCII Diagram
QUERY: "Eiffel Tower height"
q_Eiffel q_Tower q_height
| | |
[BERT encodes 3 vectors of 128d]
DOCUMENT: "The Eiffel Tower is 330 meters tall."
d_The d_Eiffel d_Tower d_is d_330 d_meters d_tall d_.
\ | | \ | | /
[BERT encodes 8 vectors of 128d, indexed offline]
SCORING (MAXSIM):
q_Eiffel: max over docs = 0.91 (matches d_Eiffel)
q_Tower: max over docs = 0.89 (matches d_Tower)
q_height: max over docs = 0.78 (matches d_meters or d_tall)
Score = 0.91 + 0.89 + 0.78 = 2.58
BI-ENCODER: ONE vector each, lossy.
CROSS-ENCODER: full attention, expensive.
LATE INTERACTION: many vectors, MaxSim aggregation. Both worlds.
Math with Translation
- = the -th query token’s encoded vector.
- = the -th document token’s encoded vector.
- Inner sum: per query token, the max similarity over all document tokens. This is a hard maximum — non-differentiable, but used at inference; training uses a soft variant or in-batch contrastive loss.
- Outer sum: aggregate the per-query-token contributions.
The crucial property: the score factors as , where the encoding of doesn’t depend on and vice versa. So ‘s vectors are pre-indexable.
Concrete Walkthrough
INDEX SIZE for ColBERT on 8.8M MS MARCO passages, ~100 tokens each:
~880M token vectors, 128d FP16 = ~225 GB.
After product quantization to 4-byte/vector: ~3.5 GB. Indexable.
QUERY-TIME COMPUTATION for a query of 8 tokens:
Step 1: Encode query (one BERT pass) -> 8 query vectors.
Step 2: For each q_i, fetch top-100 nearest doc tokens via FAISS.
Step 3: Aggregate by document: any doc that appears in ANY query token's
top-100 is a candidate (typically ~5K docs after deduplication).
Step 4: For each candidate doc, compute MaxSim against all 8 q_i.
Step 5: Rank candidates by sum.
Total: ~50ms. Comparable to bi-encoder, much better accuracy.
ACCURACY (MS MARCO MRR@10):
Bi-encoder: 31.4
Late interaction (ColBERT): 36.0
Cross-encoder (BERT): 36.5
What’s Clever
The clever move is recognizing that the bottleneck in bi-encoders is information aggregation, not encoding. The encoders are perfectly capable of producing rich token-level representations — they just aren’t allowed to use them, because the bi-encoder pools to a single vector. By keeping per-token vectors and pushing the aggregation to query time (with a cheap MaxSim), we recover most of the cross-encoder’s expressiveness without giving up offline indexability.
The second clever recognition: MaxSim is cheap enough to run on every candidate. Per query token + per doc token: one dot product. For a 10-token query and a 100-token doc: 1000 dot products. That’s nothing — vectorized, it’s a single matrix multiply. The expensive part isn’t MaxSim itself, it’s finding the candidate docs (handled by ANN over all token vectors).
The third clever move: dim reduction lets you index per-token. ColBERT projects BERT’s 768-dim output to 128-dim per token. Yes, each token’s vector is less informative than a full BERT vector — but you have many vectors per document, so the total information capacity is similar. And 128-dim is small enough that ANN over hundreds of millions of token vectors is tractable.
Key Sources
- colbert-late-interaction-retrieval — the foundational paper
- bge-c-pack-general-chinese-embeddings — BGE-m3 includes a multi-vector mode that’s effectively late interaction
Related Concepts
- bi-encoder — late interaction is a generalization; one vector → many vectors
- sentence-embeddings — late interaction produces per-token embeddings instead of per-sentence
- semantic-similarity — the standard evaluation
- contrastive-learning — late-interaction models train with contrastive losses similar to bi-encoders
Open Questions
- Index compression: ColBERTv2 introduced clustering for compression; can we go further? PLAID is the modern frontier.
- MaxSim alternatives: Why max, not weighted sum? Soft-MaxSim (logsumexp) is differentiable but less interpretable; results suggest hard max is robust enough.
- Query-time efficiency: late interaction is still ~5-10x slower than bi-encoder. How to bring this down without sacrificing accuracy?
- Multilingual late interaction: BGE-m3 supports it, but the index size is much larger. Trade-offs not yet well-characterized.