Skip to content

Conversation

@stuhood
Copy link
Collaborator

@stuhood stuhood commented Oct 15, 2025

The TermSet Query currently produces one Scorer/DocSet per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a HashSet of (encoded) term values.

Following the pattern set by the two execution modes of RangeQuery, this PR introduces a variant of TermSet which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large TermSets of primitives.

@PSeitz
Copy link
Collaborator

PSeitz commented Oct 16, 2025

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

@stuhood
Copy link
Collaborator Author

stuhood commented Oct 16, 2025

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

Hm, interesting.

That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.

For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.

stuhood and others added 3 commits October 16, 2025 12:36
* Removes allocation in a bunch of places
* Removes sorting of terms if we're going to use the fast field execution method
* Adds back the (accidentally dropped) cardinality threshold
* Removes `bool` support -- using the posting lists is always more efficient for a `bool`, since there are at most two of them
* More eagerly constructs the term `HashSet` so that it happens once, rather than once per segment
@PSeitz
Copy link
Collaborator

PSeitz commented Oct 26, 2025

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

Hm, interesting.

That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.

For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.

The BitSet would be precomputed on the inverted index. I think we would need a benchmark to justify the additional code. (also good to know :)

@stuhood
Copy link
Collaborator Author

stuhood commented Oct 26, 2025

I think we would need a benchmark to justify the additional code.

FWIW: The benchmarks that I did in ParadeDB for this PR are over here: paradedb/paradedb#3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.

@PSeitz
Copy link
Collaborator

PSeitz commented Oct 27, 2025

I think we would need a benchmark to justify the additional code.

FWIW: The benchmarks that I did in ParadeDB for this PR are over here: paradedb/paradedb#3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.

I mean comparing the BitSet variant (via BitSetDocSet) with the fastfield variant.
In the BitSet variant we would fill the BitSet one by one from the postinglists, which would fix the memory usage to num_docs/8 bytes and remove all seeking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants