Definition
Late interaction is a retrieval architecture, popularized by the ColBERT model (Khattab and Zaharia, 2020), that sits between cheap single-vector search and expensive cross-encoder reranking. Instead of compressing a document into one embedding, ColBERT keeps a separate contextual embedding for every token. Queries are likewise represented as a bag of per-token vectors.
The MaxSim operator
Relevance is computed with a late-interaction operator called MaxSim. For each query-token embedding, the system finds its maximum cosine similarity against all document-token embeddings, then sums those per-token maxima into a final score. This winner-takes-all matching captures fine-grained, term-level alignment that a single pooled vector loses, while still allowing document embeddings to be computed and stored offline ahead of time.
Why "late"
The interaction between query and document happens late, only at the cheap MaxSim step, rather than early, inside a transformer that must process the pair jointly. A cross-encoder reruns the full model for every query-document pair; ColBERT encodes documents once, indexes their token vectors, and at query time does only lightweight similarity math. The cost is storage: many vectors per document instead of one, which is the main thing to budget for when self-hosting it. Newer variants such as ColBERTv2 compress those token vectors to keep the index practical.
Late interaction is one point on a spectrum of accuracy-versus-cost trade-offs. Compare it with the heavier cross-encoder reranker at one extreme and lightweight single-vector dense retrieval at the other, then pick the tier your own hardware budget supports.
In Simple Terms
Late interaction is a retrieval architecture, popularized by the ColBERT model (Khattab and Zaharia, 2020), that sits between cheap single-vector search and expensive cross-encoder reranking.…
