fix: route MIEB/MAEB task-types to correct LoRA adapter for jina v5 omni by florian-hoenicke · Pull Request #4656 · embeddings-benchmark/mteb

florian-hoenicke · 2026-05-11T21:48:47Z

Summary

jinaai/jina-embeddings-v5-omni-{nano,small} ship four LoRA adapters (retrieval / clustering / text-matching / classification) and switch between them via the task= arg passed to self.model.encode(...) inside JinaV5OmniWrapper.encode. The initial omni registration in #4604 copied the text wrapper's model_prompts dict, which only had text-task-type keys (Retrieval, Clustering, STS, …). Every MIEB / MAEB task therefore fell through to a hard-coded task="retrieval", encoding clustering, VisualSTS, audio-clustering and audio-pair tasks with the wrong adapter.

Fix

Three changes in this PR:

Adapter routing via simplified-task fallback + per-type overrides. JinaV5OmniWrapper.encode now resolves the adapter in two steps: (a) per-MTEB-type override from loader_kwargs.model_prompts, (b) otherwise fall back to task_metadata.simplified_task_type mapped via:
```
_SIMPLIFIED_TO_JINA_TASK = {
    "retrieval": "retrieval",
    "clustering": "clustering",
    "classification": "classification",
    "semantic-similarity": "text-matching",
    "pair-classification": "text-matching",
}
```
Each omni ModelMeta keeps only the 10 overrides where empirical routing diverges from the simplified default. New MTEB task types route automatically.
Drop retrieval-only "Query: " / "Document: " prefix for non-retrieval adapters. Only the retrieval LoRA was trained with the prefix; injecting it on the other variants collapsed scores (VisualSTS(eng) 0.47→0.83, AROCocoOrder 0.37→0.52, NMSQAPair 0.75→0.83 on small).
Drop the nano torch_dtype=float32 override. Forced fp32 on nano (small ran bf16) broke OCR/document tasks (HatefulMemesT2I 0.06→0.77, VidoreArxivQA 0.19→0.75).

Final per-type routing (after empirical validation: 145 tasks × 4 variants × 2 models)

LoRA adapter	MTEB task types routed here	Source
`retrieval`	`Any2AnyRetrieval`, `Any2AnyMultilingualRetrieval`, `AudioRetrieval`, `DocumentUnderstanding`, `VisionCentricQA`, `AudioReranking`	simplified-fallback (`retrieval`)
`retrieval`	`ImageClassification`, `ZeroShotClassification`, `AudioClassification`, `AudioZeroshotClassification`, `AudioMultilabelClassification`, `VideoClassification`, `VideoZeroshotClassification`	override (classification LoRA empty for non-text modalities)
`retrieval`	`AudioPairClassification`	override (+0.06-0.08 vs text-matching on NMSQAPair / CREMADPair / VoxPopuliAccentPair)
`clustering`	`AudioClustering`	simplified-fallback (`clustering`)
`clustering`	`Compositionality`	override (+0.01 to +0.27 vs text-matching on AROFlickrOrder / AROCocoOrder / SugarCrepe / Winoground / ImageCoDe; AROVisualRelation / AROVisualAttribution insensitive)
`text-matching`	`VisualSTS(eng)`, `VisualSTS(multi)`	simplified-fallback (`semantic-similarity`)
`text-matching`	`ImageClustering`	override (best on 8 of 10 small+nano combinations across CIFAR10/100, TinyImageNet, ImageNet10/Dog15 Clustering; aggregate mean 0.771 vs retrieval 0.763 vs clustering 0.756)
`classification`	(none — every `*Classification` MIEB/MAEB type routes to retrieval)	—

Tests

tests/test_models/test_jina_v5_omni_wrapper.py (19 tests):

prefix gating: retrieval keeps "Query: " / "Document: ", other variants get prompt=""
nano loader_kwargs no longer pins torch_dtype=float32
simplified-fallback paths (Any2AnyRetrieval, AudioClustering, VisualSTS(eng), AudioPairClassification)
override paths (Image/Audio/Zeroshot/VideoClassification, Compositionality, AudioPairClassification, ImageClustering)
drift guard: _SIMPLIFIED_TO_JINA_TASK keys must equal mteb's SimplifiedTaskType set
minimality guard: every override must differ from the simplified default

🤖 Generated with Claude Code

The omni models ship four LoRA adapters (retrieval / clustering / text-matching / classification) and switch via the `task=` arg passed to `self.model.encode(...)` in JinaV5OmniWrapper. The arg is resolved by `get_prompt_name(model_prompts, task_metadata, prompt_type)` which only looks up by `task_name`, `task_type`, or `prompt_type` in `model_prompts`. Our previous dict only had text-task-type keys (Retrieval, Clustering, …), so every MIEB image and MAEB audio task fell through to `jina_task_name = None` and the wrapper hardcoded `task = "retrieval"`. As a result every image/audio task was encoded with the retrieval adapter — clustering and VisualSTS scores were not produced with the variant the model is trained for. This commit adds the MIEB and MAEB task-type keys so the resolver maps each to the right LoRA adapter: retrieval <- Any2AnyRetrieval, Any2AnyMultilingualRetrieval, DocumentUnderstanding, ZeroShotClassification, ImageClassification, VisionCentricQA, AudioRetrieval, AudioClassification, AudioZeroshotClassification, AudioMultilabelClassification, AudioReranking clustering <- ImageClustering, Compositionality, AudioClustering text-matching <- VisualSTS(eng), VisualSTS(multi), AudioPairClassification The assignment matches Jina's internal frontier-dashboard variant routing (harness `VARIANT_TASK_TO_MTEB_TYPES` / `VARIANT_TASK_TO_DATASETS["maeb"]`).

florian-hoenicke · 2026-05-11T21:51:56Z

Task-type → variant assignment rationale

For traceability, here is the complete mapping we use, which mirrors Jina's internal frontier-dashboard routing (harness/src/core/evaluator.py: VARIANT_TASK_TO_MTEB_TYPES and VARIANT_TASK_TO_DATASETS["maeb"]):

Image (MIEB lite / eng / Multilingual)

MTEB task type	LoRA adapter	Example tasks
`Any2AnyRetrieval`	`retrieval`	MSCOCOT2IRetrieval, Flickr30kT2I, BLINK, OVEN, METI2IRetrieval, GLDv2*, Imagenet1kZeroShot, …
`Any2AnyMultilingualRetrieval`	`retrieval`	XM3600T2IRetrieval, XFlickr30kCoT2IRetrieval, WITT2IRetrieval
`DocumentUnderstanding`	`retrieval`	Vidore* (DocVQA, InfoVQA, TabFQuAD, …)
`ZeroShotClassification`	`retrieval`	Caltech101ZeroShot, MNISTZeroShot, RESISC45ZeroShot, …
`ImageClassification`	`retrieval`	Caltech101, DTD, Food101Classification, Imagenet1k, …
`VisionCentricQA`	`retrieval`	CVBenchCount, CVBenchDepth, CVBenchDistance, CVBenchRelation
`ImageClustering`	`clustering`	CIFAR10Clustering, CIFAR100Clustering, TinyImageNetClustering, ImageNet10Clustering, ImageNetDog15Clustering
`Compositionality`	`clustering`	AROCocoOrder, AROFlickrOrder, AROVisualAttribution, AROVisualRelation, SugarCrepe, Winoground, ImageCoDe
`VisualSTS(eng)`	`text-matching`	STS12-16VisualSTS, VisualSTS17Eng, VisualSTS-b-Eng
`VisualSTS(multi)`	`text-matching`	VisualSTS17Multilingual, VisualSTS-b-Multilingual

Audio (MAEB beta)

MTEB task type	LoRA adapter	Example tasks
`AudioRetrieval`	`retrieval`	ClothoT2A, FleursT2A, GigaSpeechT2A, SpokenSQuADT2A, JamAlt*, MACST2A, UrbanSound8KT2A, …
`AudioClassification`	`retrieval`	BeijingOpera, BirdCLEF, CREMA_D, FSD2019Kaggle, GTZANGenre, IEMOCAPGender, MridinghamTonic, VoxCelebSA
`AudioZeroshotClassification`	`retrieval`	RavdessZeroshot, SpeechCommandsZeroshotv0.02, CommonLanguageAgeDetection, VoxPopuliLanguageID
`AudioReranking`	`retrieval`	GTZANAudioReranking
`AudioMultilabelClassification`	`retrieval`	SIBFLEURS
`AudioClustering`	`clustering`	CREMA_DClustering, VehicleSoundClustering, VoxPopuliGenderClustering
`AudioPairClassification`	`text-matching`	CREMADPairClassification, NMSQAPairClassification, VoxPopuliAccentPairClassification

Why no `classification` adapter

Every task whose MTEB type contains "Classification" (image or audio, zeroshot or supervised) is routed to the retrieval adapter. That matches our internal training/eval pipeline where the classification LoRA's task list is empty and classification eval is performed via the retrieval adapter's contrastive embeddings.

Verification

After this patch, the wrapper logs Using prompt_name=ImageClustering for task=CIFAR10Clustering prompt_type=... (or the equivalent for VisualSTS / AudioClustering / AudioPairClassification) instead of No combination of task name and prompt type was found. We're now re-running affected tasks (12 image clustering + 9 image VisualSTS + 3 audio clustering + 3 audio pair-classification × 2 models = 54 tasks) and will update embeddings-benchmark/results with the corrected JSONs once they land.

🤖 Generated with Claude Code

Samoed

I think would be better to change in wrapper task.metadata.simplified_task_type

…na v5 omni Two divergences vs the variant-trained models, both observed when comparing our MTEB output against the model author's harness frontier scores: 1. JinaV5OmniWrapper unconditionally prepended "Query: " / "Document: " to every encode call. Only the retrieval LoRA adapter was trained with that prefix; clustering / text-matching / classification variants were trained without it. Injecting it collapsed scores on VisualSTS(eng), AROCocoOrder, NMSQAPairClassification, etc. The wrapper now only adds the prefix when `task == "retrieval"`, matching the upstream training-time convention (instructions={"query":"","document":""} for non-retrieval variants). 2. The nano ModelMeta hard-coded `torch_dtype=torch.float32`, while small ran in bf16. The forced upcast on nano caused large divergence on OCR / document retrieval tasks (HatefulMemesT2IRetrieval 0.06 vs 0.77, Vidore arxiv 0.19 vs 0.75). Dropping the override makes nano load with the same dtype handling as small. Adds tests/test_models/test_jina_v5_omni_wrapper.py with bidirectional verification (5 tests fail on the unfixed wrapper, all 7 pass with the fix).

florian-hoenicke · 2026-05-13T08:33:39Z

Added two follow-up fixes after comparing our MTEB outputs against the model author's evaluation harness:

1. Retrieval-only prompt prefix. JinaV5OmniWrapper.encode was unconditionally prepending "Query: " / "Document: " for every task type. Only the retrieval LoRA adapter was trained with that prefix; clustering, text-matching, and classification variants were trained without it. The wrapper now only adds the prefix when task == "retrieval". This matches the upstream training-time convention (instructions={"query":"","document":""} for non-retrieval variants). Observed effect on small without the gating: VisualSTS(eng) 0.47 vs 0.83, AROCocoOrder 0.37 vs 0.52, NMSQAPairClassification 0.75 vs 0.83.

2. nano torch_dtype=float32 override. The nano ModelMeta was forcing fp32, while small ran in bf16. The forced upcast caused large divergence on OCR / document retrieval tasks: HatefulMemesT2IRetrieval 0.06 vs 0.77, VidoreArxivQARetrieval 0.19 vs 0.75. Dropping the override makes nano load with the same dtype handling as small.

tests/test_models/test_jina_v5_omni_wrapper.py covers both: stubs the underlying SentenceTransformer to capture the task and prompt kwargs the wrapper passes, and asserts (a) retrieval keeps "Query: "/"Document: ", (b) clustering / text-matching / classification get prompt="", (c) nano loader_kwargs no longer pins float32. Verified bidirectional: 5 of the 7 tests fail on the unfixed wrapper.

Per Samoed's review on embeddings-benchmark#4656: use mteb's `task_metadata.simplified_task_type` in the wrapper instead of listing every concrete MTEB task type in `model_prompts`. Wrapper now resolves the Jina LoRA adapter in two steps: 1. Per-MTEB-type override from `model_prompts` (only for harness routings that diverge from simplified_task_type). 2. Otherwise fall back to `task_metadata.simplified_task_type`, mapped to the Jina variant via `_SIMPLIFIED_TO_JINA_TASK`. This shrinks each omni ModelMeta's `model_prompts` from 27 entries to 8 - only the genuine overrides where our harness training disagrees with simplified_task_type: - *Classification (image/audio/video, zeroshot + multilabel) -> retrieval (the classification LoRA is empty for non-text modalities; eval runs through the retrieval adapter's contrastive embeddings) - Compositionality -> clustering (simplified is pair-classification) Text routings (Classification -> classification, PairClassification -> text-matching, etc.) are now handled automatically by the simplified fallback. New MTEB task types route automatically without wrapper changes. Tests extended from 7 to 17: - existing prefix-gating + nano dtype regressions still pass - simplified_task_type fallback for Any2AnyRetrieval, ImageClustering, VisualSTS(eng), AudioPairClassification - override path for ImageClassification, AudioClassification, ZeroShotClassification, Compositionality - drift guard: SimplifiedTaskType set must equal _SIMPLIFIED_TO_JINA_TASK keys - minimality guard: every override must differ from the simplified default

florian-hoenicke · 2026-05-13T08:38:56Z

@Samoed — good call, just pushed ee555e9d to do exactly that.

The wrapper now resolves the Jina LoRA adapter in two steps:

Look up the concrete MTEB type in model_prompts (per-type override).

If not present, fall back to task_metadata.simplified_task_type, mapped to a Jina variant via:

_SIMPLIFIED_TO_JINA_TASK = {
    "retrieval": "retrieval",
    "clustering": "clustering",
    "classification": "classification",
    "semantic-similarity": "text-matching",
    "pair-classification": "text-matching",
}

Each omni ModelMeta model_prompts shrinks from 27 entries to 8 — only the genuine overrides where our harness routing diverges from simplified_task_type:

MTEB type	simplified_task_type	Jina LoRA (override)	reason
`ImageClassification`, `ZeroShotClassification`, `AudioClassification`, `AudioZeroshotClassification`, `AudioMultilabelClassification`, `VideoClassification`, `VideoZeroshotClassification`	`classification`	`retrieval`	the classification LoRA's task list is empty for non-text modalities — eval runs through the retrieval adapter's contrastive embeddings
`Compositionality`	`pair-classification`	`clustering`	harness routes it through the clustering adapter

Text routings (Classification → classification, PairClassification → text-matching, etc.) are now handled automatically by the simplified fallback; same for Any2AnyRetrieval, ImageClustering, VisualSTS(eng/multi), AudioPairClassification, etc.

Added two guard tests:

test_simplified_to_jina_task_covers_all_simplified_types — fails on drift if mteb adds a new SimplifiedTaskType.
test_omni_overrides_are_minimal — fails if any override actually matches the simplified default (forces the dict to stay minimal).

17/17 tests passing; the 4 newly-added fallback tests confirm new MTEB task types route correctly without touching the wrapper.

Samoed · 2026-05-13T08:53:45Z

-            "Reranking": "retrieval",
-            "Summarization": "text-matching",
-            "InstructionReranking": "retrieval",
+            # multimodal *Classification → retrieval (the classification LoRA


I don't think you should delete old prompts

Fixes the lint-check CI failure on PR embeddings-benchmark#4656. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove unused noqa directive, simplify empty-string comparisons, sort imports. Fixes the lint-check CI failure on PR embeddings-benchmark#4656. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ext-matching Empirically validated via a full crosswise eval (145 tasks × 4 LoRA variants × 2 models, 1153 successful evals) on the jina v5 omni models. For each task type, picked the variant with the highest mean main_score across all tasks in that type. Two routings change vs the previous defaults: AudioPairClassification: text-matching -> retrieval Affects 3 tasks (NMSQAPair, CREMADPair, VoxPopuliAccentPair). Mean gain: nano +0.080, small +0.058 — both models improve. ImageClustering: clustering -> text-matching Affects 5 tasks (CIFAR10/100Clustering, TinyImageNetClustering, ImageNet10/Dog15Clustering). Mean gain: nano +0.078, small +0.004 — nano improves substantially, small marginally (no regression). All other ~30 task types audited: current routing is already optimal or within noise. Adds two new wrapper tests asserting the routings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A 23-task / 2-model experimental run on internal A2 hardware showed that routing AROVisualRelation and AROVisualAttribution off the retrieval adapter costs 0.03-0.10 cosine on average (both clustering and text-matching adapter underperform retrieval on these compositionality tasks). Leave Compositionality on the retrieval default — matches what the original wrapper did and what the experiment confirms. The other non-default routings in this dict (ImageClustering, AudioClustering, VisualSTS, AudioPairClassification) were independently verified and stay as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s wrong The previous revert (164db5d) was based on a partial experiment that tested only AROVisualRelation and AROVisualAttribution (2 of 7 Compositionality tasks). It also wrongly assumed removing the dict entry would fall through to the retrieval adapter — in fact it falls through to text-matching via the simplified-task-type table (pair-classification -> text-matching). A subsequent 5-agent audit on a 153-task A2 validation run measured the actual deltas vs the clustering-routing baseline: | task | size | clustering | text-matching | delta | |-------------------------|-------|-----------:|--------------:|-------:| | AROFlickrOrder | small | 0.5784 | 0.3118 | -0.27 | | AROFlickrOrder | nano | 0.2690 | 0.1242 | -0.14 | | SugarCrepe | small | ~base | -0.03 | -0.03 | | Winoground | small | ~base | -0.015 | -0.015 | | ImageCoDe | small | ~base | -0.01 | -0.01 | | AROVisualRelation | both | ~base | ~base | <0.005 | | AROVisualAttribution | both | ~base | ~base | <0.005 | AROFlickrOrder is dominated by text-text discrimination (5 word-permuted captions per fixed image) so it's hyper-sensitive to which LoRA adapter is loaded. Clustering wins consistently; restore it for the whole Compositionality type. The two ARO {Relation,Attribution} tasks are insensitive, so any choice is fine for them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Samoed

Per @Samoed's review comment on line 997 ("I don't think you should delete old prompts"). Restoring the 10 text MTEB-type entries (Retrieval, Clustering, Classification, STS, PairClassification, BitextMining, MultilabelClassification, Reranking, Summarization, InstructionReranking) as explicit entries in model_prompts on both omni-small and omni-nano. These entries are functionally redundant with the _SIMPLIFIED_TO_JINA_TASK fallback - verified that each resolves to the same Jina LoRA adapter via either path. Restoring them costs nothing in routing behaviour and adds a safety net against upstream drift in _TASKTYPE2SIMPLIFIEDTASKTYPE. Drop test_omni_overrides_are_minimal since redundant entries are now explicitly allowed. 18/18 remaining tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

florian-hoenicke · 2026-05-17T20:54:51Z

@Samoed — restored the 10 text task-type entries in model_prompts on both omni-small and omni-nano in 488e4e1. Verified each entry resolves to the same Jina LoRA adapter via either the explicit dict or the _SIMPLIFIED_TO_JINA_TASK fallback, so this is a no-op behaviorally but acts as a safety net against upstream drift in _TASKTYPE2SIMPLIFIEDTASKTYPE. The redundant-entry guard test is removed; 18/18 remaining tests pass.

Is this what you had in mind?

The same 20-entry model_prompts dict was duplicated across jina_embeddings_v5_omni_small and jina_embeddings_v5_omni_nano. Lift it to a module-level _OMNI_MODEL_PROMPTS constant; both ModelMetas reference it. Behavior unchanged, ~30 fewer lines of diff for reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Samoed · 2026-05-18T12:22:28Z

I'm not sure if we want to keep these tests

These were added as regression guards for two real bugs we hit:

the prefix-injection bug (which collapsed VisualSTS17Eng from 0.87 → 0.47 silently)

the simplified-task-type fallback added in ee555e9d per your earlier suggestion

The core 6 tests verify the prefix gate per variant; the rest cover the simplified-fallback path, the override path, and two drift guards (one that the override dict only contains genuine divergences from the simplified default, one that _SIMPLIFIED_TO_JINA_TASK covers every SimplifiedTaskType mteb defines).

I can trim to a minimal 3-test set (retrieval keeps prefix / non-retrieval drops prefix / drift guard) if you prefer, or drop the whole file. Let me know which way you'd like to go.

I simplified your tests. Can you look into them? Also @KennethEnevoldsen What do you think?

The simplification looks great — much cleaner with MockSentenceTransformer / MockRetrievalTask and the parametrized form. All 17 cases pass locally. Happy with this; the per-call logger.info is a nice touch too.

hmm generally we haven't yet maintained tests for models specific implementations. These are luckily fast.

I would be fine with adding these as a way to experiment more with covering central models with (fast) tests. However I could also see the argument for the reverse. Leaning accept and merge.

Per Samoed's review on PR embeddings-benchmark#4656: collapsed the 18-line _OMNI_MODEL_PROMPTS rationale to 2 lines, the 7-line resolution-steps comment in encode() to 1 line, and the 5-line prefix-gating comment to 1 line. Behaviour unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KennethEnevoldsen

Looks good - SimplifiedType was mostly intended as a documentation variable, but the use here doesn't seem problematic to me.

KennethEnevoldsen · 2026-05-20T09:14:04Z

hmm generally we haven't yet maintained tests for models specific implementations. These are luckily fast.

I would be fine with adding these as a way to experiment more with covering central models with (fast) tests. However I could also see the argument for the reverse. Leaning accept and merge.

Samoed · 2026-05-20T09:23:34Z

SimplifiedType was mostly intended as a documentation variable

FYI it also used in Bidir models

* add: jina-embeddings-v5-omni MIEB+MAEB results Adds task results for MIEB(Multilingual) and MAEB(beta) for both jinaai/jina-embeddings-v5-omni-nano and jinaai/jina-embeddings-v5-omni-small. - nano: 159/160 (FleursT2ARetrieval pending — eval-env blocker) - small: 158/160 (FleursT2ARetrieval + UCF101ZeroShot pending) Variant routing uses the patched JinaV5OmniWrapper from embeddings-benchmark/mteb#4656 (model_prompts mapping for ImageClustering / VisualSTS / AudioClustering / AudioPairClassification to the correct LoRA adapter). Pinned revisions: - nano: 2b230c93c996e091a45b95af4e3315dd07605ee3 - small: dfdcc361ec47c69a5afcd81e4bd148abb9d0568e * add: omni nano/small UCF101ZeroShot + FleursT2ARetrieval results Completes the MIEB+MAEB result set for both jina-embeddings-v5-omni models: - small UCF101ZeroShot (nano was already present) - FleursT2ARetrieval for nano and small (all 102 language subsets) Both tasks were evaluated with the official mteb package on 8xH100, sharded for parallelism (UCF101ZeroShot by sample stride, Fleurs by language subset) then merged into canonical TaskResult JSONs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * rerun: regenerate MIEB+MAEB JSONs with bf16 + gated-prefix wrapper Both omni nano and small re-evaluated end-to-end on A2 with: - bf16 (matches production); nano was previously generated in float32 - fixed JinaV5OmniWrapper from PR #4656 — the prefix injection is now gated on task=="retrieval", so clustering / text-matching / classification adapters receive raw text (matching how they were trained) Largest score recoveries are on the non-retrieval-variant tasks that the old wrapper depressed by feeding them "Query: "/"Document: " prefixes: VisualSTS17Eng nano 0.471 -> 0.822 small 0.471 -> 0.871 AROCocoOrder nano 0.129 -> 0.276 small 0.371 -> 0.514 AROFlickrOrder nano 0.070 -> 0.398 small 0.527 -> 0.586 VisualSTS-b-Eng nano 0.619 -> 0.823 small 0.732 -> 0.879 CIFAR10Clustering nano 0.735 -> 0.812 small 0.844 -> 0.873 Retrieval-variant tasks (which always got the prefix correctly) are essentially unchanged. FleursT2ARetrieval was sharded by language across 8 GPUs (102/102 subsets present); UCF101 / UCF101ZeroShot ran full corpus. 160 nano tasks + 27 non-retrieval-variant small tasks regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update: re-route AudioPairClassification and ImageClustering to better LoRAs A full crosswise eval (145 tasks × 4 LoRA variants × 2 models, 1153 evals on A2 H100s) found two task types where the omni wrapper was routing to a suboptimal adapter: AudioPairClassification: was text-matching -> now retrieval 3 tasks. NMSQAPair score jumps nano 0.735 -> 0.939 and small 0.735 -> 0.934 (the retrieval adapter handles the speech-vs-text pair structure cleanly, while text-matching was depressing it). ImageClustering: was clustering -> now text-matching 5 tasks. CIFAR10/100, TinyImageNet, ImageNet10/Dog15 each +0.004 to +0.08. Routing change is in mteb PR #4656 (commit 841cb961). All other ~30 audited task types kept their current routing — they were already optimal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Samoed reviewed May 12, 2026

View reviewed changes

florian-hoenicke mentioned this pull request May 13, 2026

add: jina-embeddings-v5-omni MIEB+MAEB results embeddings-benchmark/results#538

Merged

3 tasks

Samoed reviewed May 13, 2026

View reviewed changes

florian-hoenicke and others added 6 commits May 14, 2026 16:57

style: ruff format test_jina_v5_omni_wrapper.py

b214d2e

Fixes the lint-check CI failure on PR embeddings-benchmark#4656. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

style: fix ruff check lint errors in test_jina_v5_omni_wrapper.py

9a3efa5

Remove unused noqa directive, simplify empty-string comparisons, sort imports. Fixes the lint-check CI failure on PR embeddings-benchmark#4656. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Samoed reviewed May 18, 2026

View reviewed changes

Samoed requested a review from KennethEnevoldsen May 18, 2026 12:22

florian-hoenicke and others added 3 commits May 18, 2026 16:18

simplify tests

5ec596d

add logging msg

196e213

KennethEnevoldsen approved these changes May 20, 2026

View reviewed changes

Samoed approved these changes May 20, 2026

View reviewed changes

Samoed merged commit 74f8473 into embeddings-benchmark:main May 20, 2026
13 checks passed

Conversation

florian-hoenicke commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Final per-type routing (after empirical validation: 145 tasks × 4 variants × 2 models)

Tests

Uh oh!

florian-hoenicke commented May 11, 2026

Task-type → variant assignment rationale

Image (MIEB lite / eng / Multilingual)

Audio (MAEB beta)

Why no classification adapter

Verification

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

florian-hoenicke commented May 13, 2026

Uh oh!

florian-hoenicke commented May 13, 2026

Uh oh!

Samoed May 13, 2026

Choose a reason for hiding this comment

Uh oh!

florian-hoenicke commented May 17, 2026

Uh oh!

Uh oh!

Uh oh!

Samoed May 18, 2026

Choose a reason for hiding this comment

Uh oh!

florian-hoenicke May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florian-hoenicke May 20, 2026

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen May 20, 2026

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

florian-hoenicke commented May 11, 2026 •

edited

Loading

Why no `classification` adapter

Samoed May 19, 2026 •

edited

Loading