-
Notifications
You must be signed in to change notification settings - Fork 4
Description
After running the data preparation scripts on the first 10 categories (it would fail due to lack of RAM when concatenating all parquet files if I included all categories on my machine with 32GB of RAM, so I trimmed to the first 10 categories as suggested by the script), I got a dataset where 98.7% of documents have an empty or blank description.
This makes the benchmark less realistic/interesting in my mind, because titles alone tend to contain rather specific terms, that are matched by few queries, while descriptions tend to contain more common terms that are more likely to appear in queries. For instance, a query on "car" retrieves few documents despite the presence of an "Automotive" category.
I wonder if this also makes the comparison with Elasticsearch less fair due to the fact that Lucene's BM25 uses the number of docs with at least one term as the corpus size, while Vespa's BM25 seems to use the total number of docs in the index, so produced scores are not comparable?