Skip to content

add: Elastic KB Results#505

Closed
emilia-elastic wants to merge 1 commit into
embeddings-benchmark:mainfrom
emilia-elastic:add-results-elastickbretrieval
Closed

add: Elastic KB Results#505
emilia-elastic wants to merge 1 commit into
embeddings-benchmark:mainfrom
emilia-elastic:add-results-elastickbretrieval

Conversation

@emilia-elastic
Copy link
Copy Markdown

Complementary to #502 and related to embeddings-benchmark/mteb#4487

  • The results submitted are obtained using the reference implementation

@github-actions
Copy link
Copy Markdown

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: BAAI/bge-m3, KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5, Qwen/Qwen3-Embedding-0.6B, google/embeddinggemma-300m, infgrad/Jasper-Token-Compression-600M, intfloat/multilingual-e5-base, intfloat/multilingual-e5-small, jinaai/jina-embeddings-v5-text-nano, jinaai/jina-embeddings-v5-text-small, mteb/baseline-bm25s
Tasks: ElasticKBRetrieval

Results for BAAI/bge-m3

task_name BAAI/bge-m3 Max result Model with max result In Training Data
ElasticKBRetrieval 0.5337 False
Average 0.5337 nan -

Training datasets: CMedQAv1-reranking, CMedQAv2-reranking, CmedqaRetrieval, CodeSearchNet, DuRetrieval, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, LeCaRDv2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, T2Reranking, T2Retrieval, mMARCO-NL


Results for KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5

task_name KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5 Max result Model with max result In Training Data
ElasticKBRetrieval 0.5279 False
Average 0.5279 nan -

Training datasets: ATEC, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonReviewsClassification, AmazonReviewsVNClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, BQ, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CQADupstack, CodeFeedbackMT, CodeFeedbackST, ContractNLIConfidentialityOfAgreementLegalBenchClassification, ContractNLIExplicitIdentificationLegalBenchClassification, ContractNLIInclusionOfVerballyConveyedInformationLegalBenchClassification, ContractNLILimitedUseLegalBenchClassification, ContractNLINoLicensingLegalBenchClassification, ContractNLINoticeOnCompelledDisclosureLegalBenchClassification, ContractNLIPermissibleAcquirementOfSimilarInformationLegalBenchClassification, ContractNLIPermissibleCopyLegalBenchClassification, ContractNLIPermissibleDevelopmentOfSimilarInformationLegalBenchClassification, ContractNLIPermissiblePostAgreementPossessionLegalBenchClassification, ContractNLIReturnOfConfidentialInformationLegalBenchClassification, ContractNLISharingWithEmployeesLegalBenchClassification, ContractNLISharingWithThirdPartiesLegalBenchClassification, ContractNLISurvivalOfObligationsLegalBenchClassification, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, ESCIReranking, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiLongDocReranking, MultiLongDocRetrieval, MultilingualSentiment, MultilingualSentiment.v2, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoQuoraRetrieval, NanoSciFactRetrieval, PawsXPairClassification, Quora-NL, Quora-PL, Quora-PLHardNegatives, QuoraRetrieval, QuoraRetrieval-Fa, QuoraRetrieval-Fa.v2, QuoraRetrievalHardNegatives, QuoraRetrievalHardNegatives.v2, Reddit-Clustering, Reddit-Clustering-P2P, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, Stackexchange-Clustering, Stackexchange-Clustering-P2P, TRECCOVID, TRECCOVID-Fa, TRECCOVID-Fa.v2, TRECCOVID-NL, TRECCOVID-PL, TRECCOVID-VN, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroups-Clustering, YahooAnswersTopicsClassification, YahooAnswersTopicsClassification.v2, mMARCO-NL


Results for Qwen/Qwen3-Embedding-0.6B

task_name Qwen/Qwen3-Embedding-0.6B Max result Model with max result In Training Data
ElasticKBRetrieval 0.5108 False
Average 0.5108 nan -

Training datasets: CMedQAv2-reranking, CmedqaRetrieval, CodeSearchNet, DuRetrieval, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, T2Retrieval


Results for google/embeddinggemma-300m

task_name google/embeddinggemma-300m Max result Model with max result In Training Data
ElasticKBRetrieval 0.6052 False
Average 0.6052 nan -

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoNQ-VN, NanoNQRetrieval


Results for infgrad/Jasper-Token-Compression-600M

task_name infgrad/Jasper-Token-Compression-600M Max result Model with max result In Training Data
ElasticKBRetrieval 0.5517 False
Average 0.5517 nan -

Training datasets: AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonReviewsClassification, AmazonReviewsVNClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CMedQAv1-reranking, CMedQAv2-reranking, CmedqaRetrieval, Cmnli, CodeSearchNet, DuRetrieval, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, LCQMC, LeCaRDv2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MLDR, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MTOPIntentClassification, MTOPIntentVNClassification, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MindSmallReranking, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, NanoQuoraRetrieval, Ocnli, PAWSX, Quora-NL, Quora-PL, Quora-PLHardNegatives, QuoraRetrieval, QuoraRetrieval-Fa, QuoraRetrieval-Fa.v2, QuoraRetrievalHardNegatives, QuoraRetrievalHardNegatives.v2, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciDocsReranking, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, T2Reranking, T2Retrieval, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, Waimai, Waimai.v2, XQuADRetrieval, mMARCO-NL


Results for intfloat/multilingual-e5-base

task_name intfloat/multilingual-e5-base Max result Model with max result In Training Data
ElasticKBRetrieval 0.4674 False
Average 0.4674 nan -

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL


Results for intfloat/multilingual-e5-small

task_name intfloat/multilingual-e5-small Max result Model with max result In Training Data
ElasticKBRetrieval 0.4202 False
Average 0.4202 nan -

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL


Results for jinaai/jina-embeddings-v5-text-nano

task_name jinaai/jina-embeddings-v5-text-nano Max result Model with max result In Training Data
ElasticKBRetrieval 0.5676 False
Average 0.5676 nan -

Results for jinaai/jina-embeddings-v5-text-small

task_name jinaai/jina-embeddings-v5-text-small Max result Model with max result In Training Data
ElasticKBRetrieval 0.5933 False
Average 0.5933 nan -

Results for mteb/baseline-bm25s

task_name mteb/baseline-bm25s Max result Model with max result In Training Data
ElasticKBRetrieval 0.5324 False
Average 0.5324 nan -


Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.

@github-actions github-actions Bot added the stale label May 15, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed due to inactivity.

@github-actions github-actions Bot closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant