Skip to content

Add F2LLM-v2-80M results#503

Open
KennethEnevoldsen wants to merge 4 commits intomainfrom
Add-F2LLM-v2-80M
Open

Add F2LLM-v2-80M results#503
KennethEnevoldsen wants to merge 4 commits intomainfrom
Add-F2LLM-v2-80M

Conversation

@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

related to: embeddings-benchmark/mteb#4381

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: codefuse-ai/F2LLM-v2-80M
Tasks: AILACasedocs, AILAStatutes, AppsRetrieval, CUREv1, ChatDoctorRetrieval, Code1Retrieval, DS1000Retrieval, EnglishFinance1Retrieval, EnglishFinance2Retrieval, EnglishFinance3Retrieval, EnglishFinance4Retrieval, EnglishHealthcare1Retrieval, FinQARetrieval, FinanceBenchRetrieval, French1Retrieval, FrenchLegal1Retrieval, FreshStackRetrieval, German1Retrieval, GermanHealthcare1Retrieval, GermanLegal1Retrieval, HC3FinanceRetrieval, HumanEvalRetrieval, JapaneseCode1Retrieval, JapaneseLegal1Retrieval, LegalQuAD, LegalSummarization, MBPPRetrieval, WikiSQLRetrieval

Results for codefuse-ai/F2LLM-v2-80M

task_name codefuse-ai/F2LLM-v2-80M google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
AILACasedocs nan 0.4833 0.2643 0.6560 Octen/Octen-Embedding-8B-INT8 False
AILAStatutes 0.1877 0.4877 0.2084 0.9451 Octen/Octen-Embedding-8B-INT8 False
AppsRetrieval 0.522 0.9375 0.3255 0.9862 google/gemini-embedding-2-preview False
CUREv1 0.3349 0.5957 0.5162 0.6782 voyageai/voyage-4-large (embed_dim=2048) False
ChatDoctorRetrieval nan 0.7352 0.5687 0.7722 voyageai/voyage-4-large (embed_dim=2048) False
Code1Retrieval nan 0.9474 nan 0.9474 google/gemini-embedding-001 False
DS1000Retrieval nan 0.6870 nan 0.7149 google/gemini-embedding-2-preview False
EnglishFinance1Retrieval nan 0.7332 nan 0.8428 voyageai/voyage-4-large (embed_dim=2048) False
EnglishFinance2Retrieval nan 0.6740 nan 0.9137 voyageai/voyage-4-large (embed_dim=2048) False
EnglishFinance3Retrieval nan 0.8330 nan 0.8509 nvidia/NV-Embed-v2 False
EnglishFinance4Retrieval nan 0.5757 nan 0.6241 voyageai/voyage-4-large (embed_dim=2048) False
EnglishHealthcare1Retrieval nan 0.6338 nan 0.6875 mteb/baseline-bm25s False
FinQARetrieval nan 0.6464 nan 0.8897 voyageai/voyage-4-large (embed_dim=2048) False
FinanceBenchRetrieval nan 0.9157 nan 0.9459 Octen/Octen-Embedding-8B False
French1Retrieval nan 0.8781 nan 0.8884 Cohere/Cohere-embed-v4.0 False
FrenchLegal1Retrieval nan 0.8696 nan 0.9490 mteb/baseline-bm25s False
FreshStackRetrieval nan 0.3979 0.2519 0.5776 Octen/Octen-Embedding-8B False
German1Retrieval nan 0.9761 nan 0.9797 voyageai/voyage-4-large (embed_dim=2048) False
GermanHealthcare1Retrieval nan 0.8742 nan 0.9140 voyageai/voyage-4-large False
GermanLegal1Retrieval nan 0.7149 nan 0.7582 voyageai/voyage-4-large (embed_dim=2048) False
HC3FinanceRetrieval nan 0.7758 nan 0.8242 nvidia/NV-Embed-v2 False
HumanEvalRetrieval nan 0.9910 nan 1.0000 google/gemini-embedding-2-preview False
JapaneseCode1Retrieval nan 0.8650 nan 0.8650 google/gemini-embedding-001 False
JapaneseLegal1Retrieval nan 0.9228 nan 0.9228 google/gemini-embedding-001 False
LegalQuAD 0.2417 0.6553 0.4317 0.7675 mteb/baseline-bm25s False
LegalSummarization nan 0.7122 0.621 0.7921 voyageai/voyage-3.5 False
MBPPRetrieval nan 0.9416 nan 0.9608 voyageai/voyage-4-large (embed_dim=2048) False
WikiSQLRetrieval nan 0.8814 nan 0.9892 Octen/Octen-Embedding-8B False
Average 0.3216 0.7622 0.3985 0.8444 nan -

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA



Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

@Geralt-Targaryen
Copy link
Copy Markdown
Contributor

Hi, thanks for the addition. Are these results evaluated using the default task type prompts? That may be sub-optimal for many tasks. I'm also evaluating the models on more tasks and plan to submit a PR to both model implementation and results in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants