Add F2LLM-v2-80M results by KennethEnevoldsen · Pull Request #503 · embeddings-benchmark/results

KennethEnevoldsen · 2026-04-30T10:55:37Z

related to: embeddings-benchmark/mteb#4381

Checklist

My model has a model sheet, report, or similar
My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
- No, but there is an existing PR ___
The results submitted are obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

related to: embeddings-benchmark/mteb#4381

github-actions · 2026-04-30T11:18:58Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: codefuse-ai/F2LLM-v2-80M
Tasks: AILACasedocs, AILAStatutes, AppsRetrieval, CUREv1, ChatDoctorRetrieval, Code1Retrieval, DS1000Retrieval, EnglishFinance1Retrieval, EnglishFinance2Retrieval, EnglishFinance3Retrieval, EnglishFinance4Retrieval, EnglishHealthcare1Retrieval, FinQARetrieval, FinanceBenchRetrieval, French1Retrieval, FrenchLegal1Retrieval, FreshStackRetrieval, German1Retrieval, GermanHealthcare1Retrieval, GermanLegal1Retrieval, HC3FinanceRetrieval, HumanEvalRetrieval, JapaneseCode1Retrieval, JapaneseLegal1Retrieval, LegalQuAD, LegalSummarization, MBPPRetrieval, WikiSQLRetrieval

Results for `codefuse-ai/F2LLM-v2-80M`

task_name	codefuse-ai/F2LLM-v2-80M	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result	Model with max result	In Training Data
AILACasedocs	nan	0.4833	0.2643	0.6560	Octen/Octen-Embedding-8B-INT8	False
AILAStatutes	0.1877	0.4877	0.2084	0.9451	Octen/Octen-Embedding-8B-INT8	False
AppsRetrieval	0.522	0.9375	0.3255	0.9862	google/gemini-embedding-2-preview	False
CUREv1	0.3349	0.5957	0.5162	0.6782	voyageai/voyage-4-large (embed_dim=2048)	False
ChatDoctorRetrieval	nan	0.7352	0.5687	0.7722	voyageai/voyage-4-large (embed_dim=2048)	False
Code1Retrieval	nan	0.9474	nan	0.9474	google/gemini-embedding-001	False
DS1000Retrieval	nan	0.6870	nan	0.7149	google/gemini-embedding-2-preview	False
EnglishFinance1Retrieval	nan	0.7332	nan	0.8428	voyageai/voyage-4-large (embed_dim=2048)	False
EnglishFinance2Retrieval	nan	0.6740	nan	0.9137	voyageai/voyage-4-large (embed_dim=2048)	False
EnglishFinance3Retrieval	nan	0.8330	nan	0.8509	nvidia/NV-Embed-v2	False
EnglishFinance4Retrieval	nan	0.5757	nan	0.6241	voyageai/voyage-4-large (embed_dim=2048)	False
EnglishHealthcare1Retrieval	nan	0.6338	nan	0.6875	mteb/baseline-bm25s	False
FinQARetrieval	nan	0.6464	nan	0.8897	voyageai/voyage-4-large (embed_dim=2048)	False
FinanceBenchRetrieval	nan	0.9157	nan	0.9459	Octen/Octen-Embedding-8B	False
French1Retrieval	nan	0.8781	nan	0.8884	Cohere/Cohere-embed-v4.0	False
FrenchLegal1Retrieval	nan	0.8696	nan	0.9490	mteb/baseline-bm25s	False
FreshStackRetrieval	nan	0.3979	0.2519	0.5776	Octen/Octen-Embedding-8B	False
German1Retrieval	nan	0.9761	nan	0.9797	voyageai/voyage-4-large (embed_dim=2048)	False
GermanHealthcare1Retrieval	nan	0.8742	nan	0.9140	voyageai/voyage-4-large	False
GermanLegal1Retrieval	nan	0.7149	nan	0.7582	voyageai/voyage-4-large (embed_dim=2048)	False
HC3FinanceRetrieval	nan	0.7758	nan	0.8242	nvidia/NV-Embed-v2	False
HumanEvalRetrieval	nan	0.9910	nan	1.0000	google/gemini-embedding-2-preview	False
JapaneseCode1Retrieval	nan	0.8650	nan	0.8650	google/gemini-embedding-001	False
JapaneseLegal1Retrieval	nan	0.9228	nan	0.9228	google/gemini-embedding-001	False
LegalQuAD	0.2417	0.6553	0.4317	0.7675	mteb/baseline-bm25s	False
LegalSummarization	nan	0.7122	0.621	0.7921	voyageai/voyage-3.5	False
MBPPRetrieval	nan	0.9416	nan	0.9608	voyageai/voyage-4-large (embed_dim=2048)	False
WikiSQLRetrieval	nan	0.8814	nan	0.9892	Octen/Octen-Embedding-8B	False
Average	0.3216	0.7622	0.3985	0.8444	nan	-

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA

Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

Geralt-Targaryen · 2026-05-06T10:07:34Z

Hi, thanks for the addition. Are these results evaluated using the default task type prompts? That may be sub-optimal for many tasks. I'm also evaluating the models on more tasks and plan to submit a PR to both model implementation and results in a few days.

Add F2LLM-v2-80M results

a6410f7

related to: embeddings-benchmark/mteb#4381

Kenneth added 3 commits April 30, 2026 13:30

add cure

a80e0a8

add more results

6402670

add the last file

1dba8b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add F2LLM-v2-80M results#503

Add F2LLM-v2-80M results#503
KennethEnevoldsen wants to merge 4 commits intomainfrom
Add-F2LLM-v2-80M

KennethEnevoldsen commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Geralt-Targaryen commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KennethEnevoldsen commented Apr 30, 2026

Checklist

Uh oh!

github-actions Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model Results Comparison

Results for codefuse-ai/F2LLM-v2-80M

Uh oh!

Geralt-Targaryen commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Results for `codefuse-ai/F2LLM-v2-80M`