-
-
Notifications
You must be signed in to change notification settings - Fork 816
Add a fast-field variant of TermSet #2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add a fast-field variant of TermSet #2718
Conversation
… with a fast field column.
|
I think a precomputed |
Hm, interesting. That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use. For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably. |
* Removes allocation in a bunch of places * Removes sorting of terms if we're going to use the fast field execution method * Adds back the (accidentally dropped) cardinality threshold * Removes `bool` support -- using the posting lists is always more efficient for a `bool`, since there are at most two of them * More eagerly constructs the term `HashSet` so that it happens once, rather than once per segment
The BitSet would be precomputed on the inverted index. I think we would need a benchmark to justify the additional code. (also good to know :) |
FWIW: The benchmarks that I did in ParadeDB for this PR are over here: paradedb/paradedb#3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB. |
I mean comparing the BitSet variant (via |
The
TermSetQuerycurrently produces oneScorer/DocSetper matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with aHashSetof (encoded) term values.Following the pattern set by the two execution modes of
RangeQuery, this PR introduces a variant ofTermSetwhich uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).Performance is significantly improved for large
TermSets of primitives.