Most ecommerce_hybrid_search documents have empty or blank descriptions?

After running the data preparation scripts on the first 10 categories (it would fail due to lack of RAM when concatenating all parquet files if I included all categories on my machine with 32GB of RAM, so I trimmed to the first 10 categories as suggested by the script), I got a dataset where 98.7% of documents have an empty or blank description.

This makes the benchmark less realistic/interesting in my mind, because titles alone tend to contain rather specific terms, that are matched by few queries, while descriptions tend to contain more common terms that are more likely to appear in queries. For instance, a query on "car" retrieves few documents despite the presence of an "Automotive" category.

I wonder if this also makes the comparison with Elasticsearch less fair due to the fact that Lucene's BM25 uses the number of docs with at least one term as the corpus size, while Vespa's BM25 seems to use the total number of docs in the index, so produced scores are not comparable?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Most ecommerce_hybrid_search documents have empty or blank descriptions? #4474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Most ecommerce_hybrid_search documents have empty or blank descriptions? #4474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions