Add F2LLM-v2-80M results#503
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | codefuse-ai/F2LLM-v2-80M | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| AILACasedocs | nan | 0.4833 | 0.2643 | 0.6560 | Octen/Octen-Embedding-8B-INT8 | False |
| AILAStatutes | 0.1877 | 0.4877 | 0.2084 | 0.9451 | Octen/Octen-Embedding-8B-INT8 | False |
| AppsRetrieval | 0.522 | 0.9375 | 0.3255 | 0.9862 | google/gemini-embedding-2-preview | False |
| CUREv1 | 0.3349 | 0.5957 | 0.5162 | 0.6782 | voyageai/voyage-4-large (embed_dim=2048) | False |
| ChatDoctorRetrieval | nan | 0.7352 | 0.5687 | 0.7722 | voyageai/voyage-4-large (embed_dim=2048) | False |
| Code1Retrieval | nan | 0.9474 | nan | 0.9474 | google/gemini-embedding-001 | False |
| DS1000Retrieval | nan | 0.6870 | nan | 0.7149 | google/gemini-embedding-2-preview | False |
| EnglishFinance1Retrieval | nan | 0.7332 | nan | 0.8428 | voyageai/voyage-4-large (embed_dim=2048) | False |
| EnglishFinance2Retrieval | nan | 0.6740 | nan | 0.9137 | voyageai/voyage-4-large (embed_dim=2048) | False |
| EnglishFinance3Retrieval | nan | 0.8330 | nan | 0.8509 | nvidia/NV-Embed-v2 | False |
| EnglishFinance4Retrieval | nan | 0.5757 | nan | 0.6241 | voyageai/voyage-4-large (embed_dim=2048) | False |
| EnglishHealthcare1Retrieval | nan | 0.6338 | nan | 0.6875 | mteb/baseline-bm25s | False |
| FinQARetrieval | nan | 0.6464 | nan | 0.8897 | voyageai/voyage-4-large (embed_dim=2048) | False |
| FinanceBenchRetrieval | nan | 0.9157 | nan | 0.9459 | Octen/Octen-Embedding-8B | False |
| French1Retrieval | nan | 0.8781 | nan | 0.8884 | Cohere/Cohere-embed-v4.0 | False |
| FrenchLegal1Retrieval | nan | 0.8696 | nan | 0.9490 | mteb/baseline-bm25s | False |
| FreshStackRetrieval | nan | 0.3979 | 0.2519 | 0.5776 | Octen/Octen-Embedding-8B | False |
| German1Retrieval | nan | 0.9761 | nan | 0.9797 | voyageai/voyage-4-large (embed_dim=2048) | False |
| GermanHealthcare1Retrieval | nan | 0.8742 | nan | 0.9140 | voyageai/voyage-4-large | False |
| GermanLegal1Retrieval | nan | 0.7149 | nan | 0.7582 | voyageai/voyage-4-large (embed_dim=2048) | False |
| HC3FinanceRetrieval | nan | 0.7758 | nan | 0.8242 | nvidia/NV-Embed-v2 | False |
| HumanEvalRetrieval | nan | 0.9910 | nan | 1.0000 | google/gemini-embedding-2-preview | False |
| JapaneseCode1Retrieval | nan | 0.8650 | nan | 0.8650 | google/gemini-embedding-001 | False |
| JapaneseLegal1Retrieval | nan | 0.9228 | nan | 0.9228 | google/gemini-embedding-001 | False |
| LegalQuAD | 0.2417 | 0.6553 | 0.4317 | 0.7675 | mteb/baseline-bm25s | False |
| LegalSummarization | nan | 0.7122 | 0.621 | 0.7921 | voyageai/voyage-3.5 | False |
| MBPPRetrieval | nan | 0.9416 | nan | 0.9608 | voyageai/voyage-4-large (embed_dim=2048) | False |
| WikiSQLRetrieval | nan | 0.8814 | nan | 0.9892 | Octen/Octen-Embedding-8B | False |
| Average | 0.3216 | 0.7622 | 0.3985 | 0.8444 | nan | - |
Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA
Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.
|
Hi, thanks for the addition. Are these results evaluated using the default task type prompts? That may be sub-optimal for many tasks. I'm also evaluating the models on more tasks and plan to submit a PR to both model implementation and results in a few days. |
related to: embeddings-benchmark/mteb#4381
Checklist
mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here