The Problem

Bi-encoders are fast but lossy: they compress every document into a single vector. Cross-encoders are accurate but slow: they re-process query+doc together for every pair. Both are extremes. Bi-encoders bottleneck information at the pooling step (one vector per sentence); cross-encoders bottleneck at compute (one BERT run per pair). The middle ground would be independent encoding (so we can precompute and index) but fine-grained per-token interaction (so we don’t lose query-doc semantic alignment).

The Key Insight

Encode each token of the query and each token of the document independently — small per-token vectors, indexable. At query time, compute a tiny interaction step using all token vectors: for each query token, find its best-matching document token (max similarity over all document tokens). Sum across query tokens. This is “late” interaction because the query-doc comparison happens after both have been encoded — nothing in the encoding step depends on the other input.

Mechanism in Plain English

  1. Encode query: tokens → vectors of dim (small, e.g., 128).
  2. Encode each document: tokens → vectors of dim . Index them all (offline).
  3. At query time: for each query token , compute its best match against all document tokens of a candidate document: .
  4. Sum these per-query-token scores: . This is the “MaxSim” operator.
  5. Use approximate nearest neighbor on the document tokens (treat the corpus as one big set of token vectors) to identify candidate documents efficiently.

ASCII Diagram

QUERY: "Eiffel Tower height"
       q_Eiffel  q_Tower  q_height
          |        |         |
        [BERT encodes 3 vectors of 128d]


DOCUMENT: "The Eiffel Tower is 330 meters tall."
          d_The  d_Eiffel  d_Tower  d_is  d_330  d_meters  d_tall  d_.
              \      |        |         \   |       |        /
               [BERT encodes 8 vectors of 128d, indexed offline]


SCORING (MAXSIM):
  q_Eiffel: max over docs = 0.91 (matches d_Eiffel)
  q_Tower:  max over docs = 0.89 (matches d_Tower)
  q_height: max over docs = 0.78 (matches d_meters or d_tall)

  Score = 0.91 + 0.89 + 0.78 = 2.58


           BI-ENCODER:        ONE vector each, lossy.
           CROSS-ENCODER:     full attention, expensive.
           LATE INTERACTION:  many vectors, MaxSim aggregation. Both worlds.

Math with Translation

  • = the -th query token’s encoded vector.
  • = the -th document token’s encoded vector.
  • Inner sum: per query token, the max similarity over all document tokens. This is a hard maximum — non-differentiable, but used at inference; training uses a soft variant or in-batch contrastive loss.
  • Outer sum: aggregate the per-query-token contributions.

The crucial property: the score factors as , where the encoding of doesn’t depend on and vice versa. So ‘s vectors are pre-indexable.

Concrete Walkthrough

INDEX SIZE for ColBERT on 8.8M MS MARCO passages, ~100 tokens each:
  ~880M token vectors, 128d FP16 = ~225 GB.
  After product quantization to 4-byte/vector: ~3.5 GB. Indexable.

QUERY-TIME COMPUTATION for a query of 8 tokens:
  Step 1: Encode query (one BERT pass) -> 8 query vectors.
  Step 2: For each q_i, fetch top-100 nearest doc tokens via FAISS.
  Step 3: Aggregate by document: any doc that appears in ANY query token's
          top-100 is a candidate (typically ~5K docs after deduplication).
  Step 4: For each candidate doc, compute MaxSim against all 8 q_i.
  Step 5: Rank candidates by sum.

  Total: ~50ms. Comparable to bi-encoder, much better accuracy.

ACCURACY (MS MARCO MRR@10):
  Bi-encoder:    31.4
  Late interaction (ColBERT):  36.0
  Cross-encoder (BERT):         36.5

What’s Clever

The clever move is recognizing that the bottleneck in bi-encoders is information aggregation, not encoding. The encoders are perfectly capable of producing rich token-level representations — they just aren’t allowed to use them, because the bi-encoder pools to a single vector. By keeping per-token vectors and pushing the aggregation to query time (with a cheap MaxSim), we recover most of the cross-encoder’s expressiveness without giving up offline indexability.

The second clever recognition: MaxSim is cheap enough to run on every candidate. Per query token + per doc token: one dot product. For a 10-token query and a 100-token doc: 1000 dot products. That’s nothing — vectorized, it’s a single matrix multiply. The expensive part isn’t MaxSim itself, it’s finding the candidate docs (handled by ANN over all token vectors).

The third clever move: dim reduction lets you index per-token. ColBERT projects BERT’s 768-dim output to 128-dim per token. Yes, each token’s vector is less informative than a full BERT vector — but you have many vectors per document, so the total information capacity is similar. And 128-dim is small enough that ANN over hundreds of millions of token vectors is tractable.

Key Sources

Open Questions

  • Index compression: ColBERTv2 introduced clustering for compression; can we go further? PLAID is the modern frontier.
  • MaxSim alternatives: Why max, not weighted sum? Soft-MaxSim (logsumexp) is differentiable but less interpretable; results suggest hard max is robust enough.
  • Query-time efficiency: late interaction is still ~5-10x slower than bi-encoder. How to bring this down without sacrificing accuracy?
  • Multilingual late interaction: BGE-m3 supports it, but the index size is much larger. Trade-offs not yet well-characterized.