[v6] Add support for MultiVectorEncoder models#3794
Draft
tomaarsen wants to merge 7 commits into
Draft
Conversation
NohTow
reviewed
Jun 17, 2026
Comment on lines
+108
to
+109
| try: | ||
| from transformers import PaliGemmaProcessor |
NohTow
reviewed
Jun 17, 2026
| # We can't set widget examples from an IterableDataset without losing data | ||
| continue | ||
|
|
||
| if dataset[dataset_name].format["type"] == "custom": |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello!
Pull Request overview
MultiVectorEncoder, a first-class model family for ColBERT-style / late-interaction retrievalSimilarityFunction.MAXSIMandmaxsim/maxsim_pairwiseutilitiesDetails
This brings PyLate's feature set into
sentence-transformersas a first-classMultiVectorEncoderfamily sitting alongsideSentenceTransformer,SparseEncoder, andCrossEncoder. Models produce a sequence of token-level vectors per input, while scoring uses MaxSim late-interaction. The naming mirrorsSparseEncoder's output-shape framing ("late interaction" is one scoring strategy on top of multi-vector outputs, not the encoder itself).The package at
sentence_transformers/multi_vector_encoder/ships withMultiVectorTransformer(query_length/document_length, PAD -> MASK query expansion,attend_to_expansion_tokens),MultiVectorMask(skiplist),HierarchicalPooling(Ward clustering for storage compression), four losses (MultiVectorMultipleNegativesRankingLoss,CachedMultiVectorMultipleNegativesRankingLoss,MultiVectorDistillKLDivLoss,MultiVectorMarginMSELoss), and five evaluators (MultiVector{InformationRetrieval,NanoBEIR,Triplet,Distillation,Reranking}Evaluator). ColBERT MaxSim scoring lives inutil/similarity.pysomodel.similaritydispatch and evaluators can request"MaxSim"by name, while XTR scoring lives inmulti_vector_encoder/scoring/xtr.pyand slots into any of the four losses asscore_metric=XTRScores(). There's alsoKDProcessingfor join-at-iteration-time KD data, re-exported fromsentence_transformers.utilsince dense distillation flows benefit too.Hub interop is symmetric across the load paths: native ST saves load the obvious way, while PyLate v3 checkpoints (
model_type == "ColBERT") are auto-promoted via_apply_legacy_fixups. Stanford-NLP checkpoints (architectures == ["HF_ColBERT"]) are detected and load the inlinelinear.weight+artifact.metadatavia_load_default_modules. SentenceTransformer checkpoints with a finalDensehead can be converted into multi-vector models with the projection weights preserved. I've tested e.g.lightonai/GTE-ModernColBERT-v1,colbert-ir/colbertv2.0, andanswerdotai/answerai-colbert-small-v1, and all are round-trip within bf16 noise of PyLate.#3614 was a prior community attempt at the same problem and is used as a reference for the inference subset only. This PR inherits from
BaseModelrather thanSentenceTransformer(the post-v5.4 pattern) and avoids theLateInteractionPoolingmodule that conflated projection with masking.Usage
This work is still very much a work-in-progress. I'd like for the implementation to be sufficiently flexible that it can incorporate all forms of multi-vector models, from text-only (ColBERT) to text+image (ColPali) and much beyond. The idea is to implement as much as possible in standalone modules rather than the core MultiVectorEncoder class, so that future models with different architectural choices (e.g. different query expansion, skiplist, pooling, etc.) can be trained, evaluated, and loaded as expected.
There's still open questions in that regard, e.g. currently I'm working with a MultiVectorTransformer subclass of the Transformer, but perhaps I'd like to absorb all of those architectural choices into the core class and have the "Transformer" part be a more modular option. I'm also interested in trainable scoring mechanisms, but that also requires more architectural flexibility than currently exists.
I've also uploaded these models for testing:
cc @NohTow