add: Elastic KB Results by emilia-elastic · Pull Request #505 · embeddings-benchmark/results

emilia-elastic · 2026-04-30T23:28:46Z

Complementary to #502 and related to embeddings-benchmark/mteb#4487

The results submitted are obtained using the reference implementation

github-actions · 2026-04-30T23:31:10Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: BAAI/bge-m3, KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5, Qwen/Qwen3-Embedding-0.6B, google/embeddinggemma-300m, infgrad/Jasper-Token-Compression-600M, intfloat/multilingual-e5-base, intfloat/multilingual-e5-small, jinaai/jina-embeddings-v5-text-nano, jinaai/jina-embeddings-v5-text-small, mteb/baseline-bm25s
Tasks: ElasticKBRetrieval

Results for `BAAI/bge-m3`

task_name	BAAI/bge-m3	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.5337			False
Average	0.5337		nan	-

Training datasets: CMedQAv1-reranking, CMedQAv2-reranking, CmedqaRetrieval, CodeSearchNet, DuRetrieval, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, LeCaRDv2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, T2Reranking, T2Retrieval, mMARCO-NL

Results for `KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5`

task_name	KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.5279			False
Average	0.5279		nan	-

Training datasets: ATEC, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonReviewsClassification, AmazonReviewsVNClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, BQ, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CQADupstack, CodeFeedbackMT, CodeFeedbackST, ContractNLIConfidentialityOfAgreementLegalBenchClassification, ContractNLIExplicitIdentificationLegalBenchClassification, ContractNLIInclusionOfVerballyConveyedInformationLegalBenchClassification, ContractNLILimitedUseLegalBenchClassification, ContractNLINoLicensingLegalBenchClassification, ContractNLINoticeOnCompelledDisclosureLegalBenchClassification, ContractNLIPermissibleAcquirementOfSimilarInformationLegalBenchClassification, ContractNLIPermissibleCopyLegalBenchClassification, ContractNLIPermissibleDevelopmentOfSimilarInformationLegalBenchClassification, ContractNLIPermissiblePostAgreementPossessionLegalBenchClassification, ContractNLIReturnOfConfidentialInformationLegalBenchClassification, ContractNLISharingWithEmployeesLegalBenchClassification, ContractNLISharingWithThirdPartiesLegalBenchClassification, ContractNLISurvivalOfObligationsLegalBenchClassification, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, ESCIReranking, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiLongDocReranking, MultiLongDocRetrieval, MultilingualSentiment, MultilingualSentiment.v2, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoQuoraRetrieval, NanoSciFactRetrieval, PawsXPairClassification, Quora-NL, Quora-PL, Quora-PLHardNegatives, QuoraRetrieval, QuoraRetrieval-Fa, QuoraRetrieval-Fa.v2, QuoraRetrievalHardNegatives, QuoraRetrievalHardNegatives.v2, Reddit-Clustering, Reddit-Clustering-P2P, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, Stackexchange-Clustering, Stackexchange-Clustering-P2P, TRECCOVID, TRECCOVID-Fa, TRECCOVID-Fa.v2, TRECCOVID-NL, TRECCOVID-PL, TRECCOVID-VN, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroups-Clustering, YahooAnswersTopicsClassification, YahooAnswersTopicsClassification.v2, mMARCO-NL

Results for `Qwen/Qwen3-Embedding-0.6B`

task_name	Qwen/Qwen3-Embedding-0.6B	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.5108			False
Average	0.5108		nan	-

Training datasets: CMedQAv2-reranking, CmedqaRetrieval, CodeSearchNet, DuRetrieval, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, T2Retrieval

Results for `google/embeddinggemma-300m`

task_name	google/embeddinggemma-300m	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.6052			False
Average	0.6052		nan	-

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoNQ-VN, NanoNQRetrieval

Results for `infgrad/Jasper-Token-Compression-600M`

task_name	infgrad/Jasper-Token-Compression-600M	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.5517			False
Average	0.5517		nan	-

Training datasets: AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonReviewsClassification, AmazonReviewsVNClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CMedQAv1-reranking, CMedQAv2-reranking, CmedqaRetrieval, Cmnli, CodeSearchNet, DuRetrieval, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, LCQMC, LeCaRDv2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MLDR, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MTOPIntentClassification, MTOPIntentVNClassification, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MindSmallReranking, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, NanoQuoraRetrieval, Ocnli, PAWSX, Quora-NL, Quora-PL, Quora-PLHardNegatives, QuoraRetrieval, QuoraRetrieval-Fa, QuoraRetrieval-Fa.v2, QuoraRetrievalHardNegatives, QuoraRetrievalHardNegatives.v2, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciDocsReranking, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, T2Reranking, T2Retrieval, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, Waimai, Waimai.v2, XQuADRetrieval, mMARCO-NL

Results for `intfloat/multilingual-e5-base`

task_name	intfloat/multilingual-e5-base	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.4674			False
Average	0.4674		nan	-

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL

Results for `intfloat/multilingual-e5-small`

task_name	intfloat/multilingual-e5-small	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.4202			False
Average	0.4202		nan	-

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL

Results for `jinaai/jina-embeddings-v5-text-nano`

task_name	jinaai/jina-embeddings-v5-text-nano	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.5676			False
Average	0.5676		nan	-

Results for `jinaai/jina-embeddings-v5-text-small`

task_name	jinaai/jina-embeddings-v5-text-small	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.5933			False
Average	0.5933		nan	-

Results for `mteb/baseline-bm25s`

task_name	mteb/baseline-bm25s	Max result	Model with max result	In Training Data
ElasticKBRetrieval	0.5324			False
Average	0.5324		nan	-

Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

github-actions · 2026-05-15T02:23:09Z

This pull request has been automatically marked as stale due to inactivity.

github-actions · 2026-05-22T02:24:53Z

This pull request has been automatically closed due to inactivity.

add: Elastic KB Results

89985e2

github-actions Bot added the stale label May 15, 2026

github-actions Bot closed this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add: Elastic KB Results#505

add: Elastic KB Results#505
emilia-elastic wants to merge 1 commit into
embeddings-benchmark:mainfrom
emilia-elastic:add-results-elastickbretrieval

emilia-elastic commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emilia-elastic commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Model Results Comparison

Results for BAAI/bge-m3

Results for KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5

Results for Qwen/Qwen3-Embedding-0.6B

Results for google/embeddinggemma-300m

Results for infgrad/Jasper-Token-Compression-600M

Results for intfloat/multilingual-e5-base

Results for intfloat/multilingual-e5-small

Results for jinaai/jina-embeddings-v5-text-nano

Results for jinaai/jina-embeddings-v5-text-small

Results for mteb/baseline-bm25s

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results for `BAAI/bge-m3`

Results for `KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5`

Results for `Qwen/Qwen3-Embedding-0.6B`

Results for `google/embeddinggemma-300m`

Results for `infgrad/Jasper-Token-Compression-600M`

Results for `intfloat/multilingual-e5-base`

Results for `intfloat/multilingual-e5-small`

Results for `jinaai/jina-embeddings-v5-text-nano`

Results for `jinaai/jina-embeddings-v5-text-small`

Results for `mteb/baseline-bm25s`