Skip to content

fix: route MIEB/MAEB task-types to correct LoRA adapter for jina v5 omni#4656

Merged
Samoed merged 13 commits into
embeddings-benchmark:mainfrom
florian-hoenicke:fix-jina-omni-model-prompts
May 20, 2026
Merged

fix: route MIEB/MAEB task-types to correct LoRA adapter for jina v5 omni#4656
Samoed merged 13 commits into
embeddings-benchmark:mainfrom
florian-hoenicke:fix-jina-omni-model-prompts

Conversation

@florian-hoenicke
Copy link
Copy Markdown
Contributor

@florian-hoenicke florian-hoenicke commented May 11, 2026

Summary

jinaai/jina-embeddings-v5-omni-{nano,small} ship four LoRA adapters (retrieval / clustering / text-matching / classification) and switch between them via the task= arg passed to self.model.encode(...) inside JinaV5OmniWrapper.encode. The initial omni registration in #4604 copied the text wrapper's model_prompts dict, which only had text-task-type keys (Retrieval, Clustering, STS, …). Every MIEB / MAEB task therefore fell through to a hard-coded task="retrieval", encoding clustering, VisualSTS, audio-clustering and audio-pair tasks with the wrong adapter.

Fix

Three changes in this PR:

  1. Adapter routing via simplified-task fallback + per-type overrides. JinaV5OmniWrapper.encode now resolves the adapter in two steps: (a) per-MTEB-type override from loader_kwargs.model_prompts, (b) otherwise fall back to task_metadata.simplified_task_type mapped via:

    _SIMPLIFIED_TO_JINA_TASK = {
        "retrieval": "retrieval",
        "clustering": "clustering",
        "classification": "classification",
        "semantic-similarity": "text-matching",
        "pair-classification": "text-matching",
    }

    Each omni ModelMeta keeps only the 10 overrides where empirical routing diverges from the simplified default. New MTEB task types route automatically.

  2. Drop retrieval-only "Query: " / "Document: " prefix for non-retrieval adapters. Only the retrieval LoRA was trained with the prefix; injecting it on the other variants collapsed scores (VisualSTS(eng) 0.47→0.83, AROCocoOrder 0.37→0.52, NMSQAPair 0.75→0.83 on small).

  3. Drop the nano torch_dtype=float32 override. Forced fp32 on nano (small ran bf16) broke OCR/document tasks (HatefulMemesT2I 0.06→0.77, VidoreArxivQA 0.19→0.75).

Final per-type routing (after empirical validation: 145 tasks × 4 variants × 2 models)

LoRA adapter MTEB task types routed here Source
retrieval Any2AnyRetrieval, Any2AnyMultilingualRetrieval, AudioRetrieval, DocumentUnderstanding, VisionCentricQA, AudioReranking simplified-fallback (retrieval)
retrieval ImageClassification, ZeroShotClassification, AudioClassification, AudioZeroshotClassification, AudioMultilabelClassification, VideoClassification, VideoZeroshotClassification override (classification LoRA empty for non-text modalities)
retrieval AudioPairClassification override (+0.06-0.08 vs text-matching on NMSQAPair / CREMADPair / VoxPopuliAccentPair)
clustering AudioClustering simplified-fallback (clustering)
clustering Compositionality override (+0.01 to +0.27 vs text-matching on AROFlickrOrder / AROCocoOrder / SugarCrepe / Winoground / ImageCoDe; AROVisualRelation / AROVisualAttribution insensitive)
text-matching VisualSTS(eng), VisualSTS(multi) simplified-fallback (semantic-similarity)
text-matching ImageClustering override (best on 8 of 10 small+nano combinations across CIFAR10/100, TinyImageNet, ImageNet10/Dog15 Clustering; aggregate mean 0.771 vs retrieval 0.763 vs clustering 0.756)
classification (none — every *Classification MIEB/MAEB type routes to retrieval)

Tests

tests/test_models/test_jina_v5_omni_wrapper.py (19 tests):

  • prefix gating: retrieval keeps "Query: " / "Document: ", other variants get prompt=""
  • nano loader_kwargs no longer pins torch_dtype=float32
  • simplified-fallback paths (Any2AnyRetrieval, AudioClustering, VisualSTS(eng), AudioPairClassification)
  • override paths (Image/Audio/Zeroshot/VideoClassification, Compositionality, AudioPairClassification, ImageClustering)
  • drift guard: _SIMPLIFIED_TO_JINA_TASK keys must equal mteb's SimplifiedTaskType set
  • minimality guard: every override must differ from the simplified default

🤖 Generated with Claude Code

The omni models ship four LoRA adapters (retrieval / clustering /
text-matching / classification) and switch via the `task=` arg passed
to `self.model.encode(...)` in JinaV5OmniWrapper. The arg is resolved by
`get_prompt_name(model_prompts, task_metadata, prompt_type)` which only
looks up by `task_name`, `task_type`, or `prompt_type` in
`model_prompts`. Our previous dict only had text-task-type keys
(Retrieval, Clustering, …), so every MIEB image and MAEB audio task
fell through to `jina_task_name = None` and the wrapper hardcoded
`task = "retrieval"`. As a result every image/audio task was encoded
with the retrieval adapter — clustering and VisualSTS scores were
not produced with the variant the model is trained for.

This commit adds the MIEB and MAEB task-type keys so the resolver maps
each to the right LoRA adapter:

  retrieval     <- Any2AnyRetrieval, Any2AnyMultilingualRetrieval,
                   DocumentUnderstanding, ZeroShotClassification,
                   ImageClassification, VisionCentricQA,
                   AudioRetrieval, AudioClassification,
                   AudioZeroshotClassification,
                   AudioMultilabelClassification, AudioReranking
  clustering    <- ImageClustering, Compositionality, AudioClustering
  text-matching <- VisualSTS(eng), VisualSTS(multi),
                   AudioPairClassification

The assignment matches Jina's internal frontier-dashboard variant
routing (harness `VARIANT_TASK_TO_MTEB_TYPES` /
`VARIANT_TASK_TO_DATASETS["maeb"]`).
@florian-hoenicke
Copy link
Copy Markdown
Contributor Author

Task-type → variant assignment rationale

For traceability, here is the complete mapping we use, which mirrors Jina's internal frontier-dashboard routing (harness/src/core/evaluator.py: VARIANT_TASK_TO_MTEB_TYPES and VARIANT_TASK_TO_DATASETS["maeb"]):

Image (MIEB lite / eng / Multilingual)

MTEB task type LoRA adapter Example tasks
Any2AnyRetrieval retrieval MSCOCOT2IRetrieval, Flickr30kT2I, BLINK*, OVEN*, METI2IRetrieval, GLDv2*, Imagenet1kZeroShot, …
Any2AnyMultilingualRetrieval retrieval XM3600T2IRetrieval, XFlickr30kCoT2IRetrieval, WITT2IRetrieval
DocumentUnderstanding retrieval Vidore* (DocVQA, InfoVQA, TabFQuAD, …)
ZeroShotClassification retrieval Caltech101ZeroShot, MNISTZeroShot, RESISC45ZeroShot, …
ImageClassification retrieval Caltech101, DTD, Food101Classification, Imagenet1k, …
VisionCentricQA retrieval CVBenchCount, CVBenchDepth, CVBenchDistance, CVBenchRelation
ImageClustering clustering CIFAR10Clustering, CIFAR100Clustering, TinyImageNetClustering, ImageNet10Clustering, ImageNetDog15Clustering
Compositionality clustering AROCocoOrder, AROFlickrOrder, AROVisualAttribution, AROVisualRelation, SugarCrepe, Winoground, ImageCoDe
VisualSTS(eng) text-matching STS12-16VisualSTS, VisualSTS17Eng, VisualSTS-b-Eng
VisualSTS(multi) text-matching VisualSTS17Multilingual, VisualSTS-b-Multilingual

Audio (MAEB beta)

MTEB task type LoRA adapter Example tasks
AudioRetrieval retrieval ClothoT2A, FleursT2A, GigaSpeechT2A, SpokenSQuADT2A, JamAlt*, MACST2A, UrbanSound8KT2A, …
AudioClassification retrieval BeijingOpera, BirdCLEF, CREMA_D, FSD2019Kaggle, GTZANGenre, IEMOCAPGender, MridinghamTonic, VoxCelebSA
AudioZeroshotClassification retrieval RavdessZeroshot, SpeechCommandsZeroshotv0.02, CommonLanguageAgeDetection, VoxPopuliLanguageID
AudioReranking retrieval GTZANAudioReranking
AudioMultilabelClassification retrieval SIBFLEURS
AudioClustering clustering CREMA_DClustering, VehicleSoundClustering, VoxPopuliGenderClustering
AudioPairClassification text-matching CREMADPairClassification, NMSQAPairClassification, VoxPopuliAccentPairClassification

Why no classification adapter

Every task whose MTEB type contains "Classification" (image or audio, zeroshot or supervised) is routed to the retrieval adapter. That matches our internal training/eval pipeline where the classification LoRA's task list is empty and classification eval is performed via the retrieval adapter's contrastive embeddings.

Verification

After this patch, the wrapper logs Using prompt_name=ImageClustering for task=CIFAR10Clustering prompt_type=... (or the equivalent for VisualSTS / AudioClustering / AudioPairClassification) instead of No combination of task name and prompt type was found. We're now re-running affected tasks (12 image clustering + 9 image VisualSTS + 3 audio clustering + 3 audio pair-classification × 2 models = 54 tasks) and will update embeddings-benchmark/results with the corrected JSONs once they land.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be better to change in wrapper task.metadata.simplified_task_type

…na v5 omni

Two divergences vs the variant-trained models, both observed when comparing
our MTEB output against the model author's harness frontier scores:

1. JinaV5OmniWrapper unconditionally prepended "Query: " / "Document: " to
   every encode call. Only the retrieval LoRA adapter was trained with that
   prefix; clustering / text-matching / classification variants were trained
   without it. Injecting it collapsed scores on VisualSTS(eng), AROCocoOrder,
   NMSQAPairClassification, etc. The wrapper now only adds the prefix when
   `task == "retrieval"`, matching the upstream training-time convention
   (instructions={"query":"","document":""} for non-retrieval variants).

2. The nano ModelMeta hard-coded `torch_dtype=torch.float32`, while small ran
   in bf16. The forced upcast on nano caused large divergence on OCR /
   document retrieval tasks (HatefulMemesT2IRetrieval 0.06 vs 0.77, Vidore
   arxiv 0.19 vs 0.75). Dropping the override makes nano load with the same
   dtype handling as small.

Adds tests/test_models/test_jina_v5_omni_wrapper.py with bidirectional
verification (5 tests fail on the unfixed wrapper, all 7 pass with the fix).
@florian-hoenicke
Copy link
Copy Markdown
Contributor Author

Added two follow-up fixes after comparing our MTEB outputs against the model author's evaluation harness:

1. Retrieval-only prompt prefix. JinaV5OmniWrapper.encode was unconditionally prepending "Query: " / "Document: " for every task type. Only the retrieval LoRA adapter was trained with that prefix; clustering, text-matching, and classification variants were trained without it. The wrapper now only adds the prefix when task == "retrieval". This matches the upstream training-time convention (instructions={"query":"","document":""} for non-retrieval variants). Observed effect on small without the gating: VisualSTS(eng) 0.47 vs 0.83, AROCocoOrder 0.37 vs 0.52, NMSQAPairClassification 0.75 vs 0.83.

2. nano torch_dtype=float32 override. The nano ModelMeta was forcing fp32, while small ran in bf16. The forced upcast caused large divergence on OCR / document retrieval tasks: HatefulMemesT2IRetrieval 0.06 vs 0.77, VidoreArxivQARetrieval 0.19 vs 0.75. Dropping the override makes nano load with the same dtype handling as small.

tests/test_models/test_jina_v5_omni_wrapper.py covers both: stubs the underlying SentenceTransformer to capture the task and prompt kwargs the wrapper passes, and asserts (a) retrieval keeps "Query: "/"Document: ", (b) clustering / text-matching / classification get prompt="", (c) nano loader_kwargs no longer pins float32. Verified bidirectional: 5 of the 7 tests fail on the unfixed wrapper.

Per Samoed's review on embeddings-benchmark#4656: use mteb's `task_metadata.simplified_task_type`
in the wrapper instead of listing every concrete MTEB task type in
`model_prompts`.

Wrapper now resolves the Jina LoRA adapter in two steps:
1. Per-MTEB-type override from `model_prompts` (only for harness routings
   that diverge from simplified_task_type).
2. Otherwise fall back to `task_metadata.simplified_task_type`, mapped to
   the Jina variant via `_SIMPLIFIED_TO_JINA_TASK`.

This shrinks each omni ModelMeta's `model_prompts` from 27 entries to 8 -
only the genuine overrides where our harness training disagrees with
simplified_task_type:

  - *Classification (image/audio/video, zeroshot + multilabel) -> retrieval
    (the classification LoRA is empty for non-text modalities; eval runs
    through the retrieval adapter's contrastive embeddings)
  - Compositionality -> clustering (simplified is pair-classification)

Text routings (Classification -> classification, PairClassification ->
text-matching, etc.) are now handled automatically by the simplified
fallback. New MTEB task types route automatically without wrapper changes.

Tests extended from 7 to 17:
  - existing prefix-gating + nano dtype regressions still pass
  - simplified_task_type fallback for Any2AnyRetrieval, ImageClustering,
    VisualSTS(eng), AudioPairClassification
  - override path for ImageClassification, AudioClassification,
    ZeroShotClassification, Compositionality
  - drift guard: SimplifiedTaskType set must equal _SIMPLIFIED_TO_JINA_TASK keys
  - minimality guard: every override must differ from the simplified default
@florian-hoenicke
Copy link
Copy Markdown
Contributor Author

@Samoed — good call, just pushed ee555e9d to do exactly that.

The wrapper now resolves the Jina LoRA adapter in two steps:

  1. Look up the concrete MTEB type in model_prompts (per-type override).
  2. If not present, fall back to task_metadata.simplified_task_type, mapped to a Jina variant via:
    _SIMPLIFIED_TO_JINA_TASK = {
        "retrieval": "retrieval",
        "clustering": "clustering",
        "classification": "classification",
        "semantic-similarity": "text-matching",
        "pair-classification": "text-matching",
    }

Each omni ModelMeta model_prompts shrinks from 27 entries to 8 — only the genuine overrides where our harness routing diverges from simplified_task_type:

MTEB type simplified_task_type Jina LoRA (override) reason
ImageClassification, ZeroShotClassification, AudioClassification, AudioZeroshotClassification, AudioMultilabelClassification, VideoClassification, VideoZeroshotClassification classification retrieval the classification LoRA's task list is empty for non-text modalities — eval runs through the retrieval adapter's contrastive embeddings
Compositionality pair-classification clustering harness routes it through the clustering adapter

Text routings (Classificationclassification, PairClassificationtext-matching, etc.) are now handled automatically by the simplified fallback; same for Any2AnyRetrieval, ImageClustering, VisualSTS(eng/multi), AudioPairClassification, etc.

Added two guard tests:

  • test_simplified_to_jina_task_covers_all_simplified_types — fails on drift if mteb adds a new SimplifiedTaskType.
  • test_omni_overrides_are_minimal — fails if any override actually matches the simplified default (forces the dict to stay minimal).

17/17 tests passing; the 4 newly-added fallback tests confirm new MTEB task types route correctly without touching the wrapper.

"Reranking": "retrieval",
"Summarization": "text-matching",
"InstructionReranking": "retrieval",
# multimodal *Classification → retrieval (the classification LoRA
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should delete old prompts

florian-hoenicke and others added 6 commits May 14, 2026 16:57
Fixes the lint-check CI failure on PR embeddings-benchmark#4656.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove unused noqa directive, simplify empty-string comparisons,
sort imports. Fixes the lint-check CI failure on PR embeddings-benchmark#4656.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ext-matching

Empirically validated via a full crosswise eval (145 tasks × 4 LoRA variants
× 2 models, 1153 successful evals) on the jina v5 omni models. For each task
type, picked the variant with the highest mean main_score across all tasks
in that type.

Two routings change vs the previous defaults:

  AudioPairClassification: text-matching -> retrieval
    Affects 3 tasks (NMSQAPair, CREMADPair, VoxPopuliAccentPair).
    Mean gain: nano +0.080, small +0.058 — both models improve.

  ImageClustering: clustering -> text-matching
    Affects 5 tasks (CIFAR10/100Clustering, TinyImageNetClustering,
    ImageNet10/Dog15Clustering).
    Mean gain: nano +0.078, small +0.004 — nano improves substantially,
    small marginally (no regression).

All other ~30 task types audited: current routing is already optimal or
within noise.

Adds two new wrapper tests asserting the routings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 23-task / 2-model experimental run on internal A2 hardware showed that
routing AROVisualRelation and AROVisualAttribution off the retrieval adapter
costs 0.03-0.10 cosine on average (both clustering and text-matching adapter
underperform retrieval on these compositionality tasks). Leave Compositionality
on the retrieval default — matches what the original wrapper did and what
the experiment confirms.

The other non-default routings in this dict (ImageClustering, AudioClustering,
VisualSTS, AudioPairClassification) were independently verified and stay as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s wrong

The previous revert (164db5d) was based on a partial experiment that tested
only AROVisualRelation and AROVisualAttribution (2 of 7 Compositionality
tasks). It also wrongly assumed removing the dict entry would fall through to
the retrieval adapter — in fact it falls through to text-matching via the
simplified-task-type table (pair-classification -> text-matching).

A subsequent 5-agent audit on a 153-task A2 validation run measured the
actual deltas vs the clustering-routing baseline:

  | task                    | size  | clustering | text-matching | delta  |
  |-------------------------|-------|-----------:|--------------:|-------:|
  | AROFlickrOrder          | small |     0.5784 |        0.3118 | -0.27  |
  | AROFlickrOrder          | nano  |     0.2690 |        0.1242 | -0.14  |
  | SugarCrepe              | small |     ~base  |        -0.03  | -0.03  |
  | Winoground              | small |     ~base  |        -0.015 | -0.015 |
  | ImageCoDe               | small |     ~base  |        -0.01  | -0.01  |
  | AROVisualRelation       | both  |     ~base  |        ~base  | <0.005 |
  | AROVisualAttribution    | both  |     ~base  |        ~base  | <0.005 |

AROFlickrOrder is dominated by text-text discrimination (5 word-permuted
captions per fixed image) so it's hyper-sensitive to which LoRA adapter is
loaded. Clustering wins consistently; restore it for the whole Compositionality
type. The two ARO {Relation,Attribution} tasks are insensitive, so any choice
is fine for them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @Samoed's review comment on line 997 ("I don't think you should delete
old prompts"). Restoring the 10 text MTEB-type entries (Retrieval, Clustering,
Classification, STS, PairClassification, BitextMining, MultilabelClassification,
Reranking, Summarization, InstructionReranking) as explicit entries in
model_prompts on both omni-small and omni-nano.

These entries are functionally redundant with the _SIMPLIFIED_TO_JINA_TASK
fallback - verified that each resolves to the same Jina LoRA adapter via
either path. Restoring them costs nothing in routing behaviour and adds a
safety net against upstream drift in _TASKTYPE2SIMPLIFIEDTASKTYPE.

Drop test_omni_overrides_are_minimal since redundant entries are now
explicitly allowed. 18/18 remaining tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@florian-hoenicke
Copy link
Copy Markdown
Contributor Author

@Samoed — restored the 10 text task-type entries in model_prompts on both omni-small and omni-nano in 488e4e1. Verified each entry resolves to the same Jina LoRA adapter via either the explicit dict or the _SIMPLIFIED_TO_JINA_TASK fallback, so this is a no-op behaviorally but acts as a safety net against upstream drift in _TASKTYPE2SIMPLIFIEDTASKTYPE. The redundant-entry guard test is removed; 18/18 remaining tests pass.

Is this what you had in mind?

The same 20-entry model_prompts dict was duplicated across
jina_embeddings_v5_omni_small and jina_embeddings_v5_omni_nano. Lift it to
a module-level _OMNI_MODEL_PROMPTS constant; both ModelMetas reference it.
Behavior unchanged, ~30 fewer lines of diff for reviewers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread mteb/models/model_implementations/jina_models.py
Comment thread mteb/models/model_implementations/jina_models.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to keep these tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were added as regression guards for two real bugs we hit:

  • the prefix-injection bug (which collapsed VisualSTS17Eng from 0.87 → 0.47 silently)
  • the simplified-task-type fallback added in ee555e9d per your earlier suggestion

The core 6 tests verify the prefix gate per variant; the rest cover the simplified-fallback path, the override path, and two drift guards (one that the override dict only contains genuine divergences from the simplified default, one that _SIMPLIFIED_TO_JINA_TASK covers every SimplifiedTaskType mteb defines).

I can trim to a minimal 3-test set (retrieval keeps prefix / non-retrieval drops prefix / drift guard) if you prefer, or drop the whole file. Let me know which way you'd like to go.

Copy link
Copy Markdown
Member

@Samoed Samoed May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified your tests. Can you look into them? Also @KennethEnevoldsen What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplification looks great — much cleaner with MockSentenceTransformer / MockRetrievalTask and the parametrized form. All 17 cases pass locally. Happy with this; the per-call logger.info is a nice touch too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm generally we haven't yet maintained tests for models specific implementations. These are luckily fast.

I would be fine with adding these as a way to experiment more with covering central models with (fast) tests. However I could also see the argument for the reverse. Leaning accept and merge.

@Samoed Samoed requested a review from KennethEnevoldsen May 18, 2026 12:22
florian-hoenicke and others added 3 commits May 18, 2026 16:18
Per Samoed's review on PR embeddings-benchmark#4656: collapsed the 18-line _OMNI_MODEL_PROMPTS
rationale to 2 lines, the 7-line resolution-steps comment in encode() to
1 line, and the 5-line prefix-gating comment to 1 line. Behaviour unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - SimplifiedType was mostly intended as a documentation variable, but the use here doesn't seem problematic to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm generally we haven't yet maintained tests for models specific implementations. These are luckily fast.

I would be fine with adding these as a way to experiment more with covering central models with (fast) tests. However I could also see the argument for the reverse. Leaning accept and merge.

@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 20, 2026

SimplifiedType was mostly intended as a documentation variable

FYI it also used in Bidir models

@Samoed Samoed merged commit 74f8473 into embeddings-benchmark:main May 20, 2026
13 checks passed
Samoed pushed a commit to embeddings-benchmark/results that referenced this pull request May 20, 2026
* add: jina-embeddings-v5-omni MIEB+MAEB results

Adds task results for MIEB(Multilingual) and MAEB(beta) for both
jinaai/jina-embeddings-v5-omni-nano and jinaai/jina-embeddings-v5-omni-small.

- nano: 159/160 (FleursT2ARetrieval pending — eval-env blocker)
- small: 158/160 (FleursT2ARetrieval + UCF101ZeroShot pending)

Variant routing uses the patched JinaV5OmniWrapper from
embeddings-benchmark/mteb#4656 (model_prompts mapping for
ImageClustering / VisualSTS / AudioClustering / AudioPairClassification
to the correct LoRA adapter).

Pinned revisions:
- nano: 2b230c93c996e091a45b95af4e3315dd07605ee3
- small: dfdcc361ec47c69a5afcd81e4bd148abb9d0568e

* add: omni nano/small UCF101ZeroShot + FleursT2ARetrieval results

Completes the MIEB+MAEB result set for both jina-embeddings-v5-omni models:
- small UCF101ZeroShot (nano was already present)
- FleursT2ARetrieval for nano and small (all 102 language subsets)

Both tasks were evaluated with the official mteb package on 8xH100,
sharded for parallelism (UCF101ZeroShot by sample stride, Fleurs by
language subset) then merged into canonical TaskResult JSONs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* rerun: regenerate MIEB+MAEB JSONs with bf16 + gated-prefix wrapper

Both omni nano and small re-evaluated end-to-end on A2 with:
- bf16 (matches production); nano was previously generated in float32
- fixed JinaV5OmniWrapper from PR #4656 — the prefix injection is now
  gated on task=="retrieval", so clustering / text-matching / classification
  adapters receive raw text (matching how they were trained)

Largest score recoveries are on the non-retrieval-variant tasks that the
old wrapper depressed by feeding them "Query: "/"Document: " prefixes:

  VisualSTS17Eng     nano 0.471 -> 0.822     small 0.471 -> 0.871
  AROCocoOrder       nano 0.129 -> 0.276     small 0.371 -> 0.514
  AROFlickrOrder     nano 0.070 -> 0.398     small 0.527 -> 0.586
  VisualSTS-b-Eng    nano 0.619 -> 0.823     small 0.732 -> 0.879
  CIFAR10Clustering  nano 0.735 -> 0.812     small 0.844 -> 0.873

Retrieval-variant tasks (which always got the prefix correctly) are
essentially unchanged. FleursT2ARetrieval was sharded by language across
8 GPUs (102/102 subsets present); UCF101 / UCF101ZeroShot ran full corpus.
160 nano tasks + 27 non-retrieval-variant small tasks regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* update: re-route AudioPairClassification and ImageClustering to better LoRAs

A full crosswise eval (145 tasks × 4 LoRA variants × 2 models, 1153 evals on
A2 H100s) found two task types where the omni wrapper was routing to a
suboptimal adapter:

  AudioPairClassification: was text-matching -> now retrieval
    3 tasks. NMSQAPair score jumps nano 0.735 -> 0.939 and small 0.735 -> 0.934
    (the retrieval adapter handles the speech-vs-text pair structure cleanly,
    while text-matching was depressing it).

  ImageClustering: was clustering -> now text-matching
    5 tasks. CIFAR10/100, TinyImageNet, ImageNet10/Dog15 each +0.004 to +0.08.

Routing change is in mteb PR #4656 (commit 841cb961). All other ~30 audited
task types kept their current routing — they were already optimal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants