Add a fast-field variant of TermSet #2718

stuhood · 2025-10-15T23:45:23Z

The TermSet Query currently produces one Scorer/DocSet per matched term by scanning the term dictionary and then consuming posting lists. For very large sets of terms and a fast field, it is faster to scan the fast field column while intersecting with a HashSet of (encoded) term values.

Following the pattern set by the two execution modes of RangeQuery, this PR introduces a variant of TermSet which uses fast fields, and then uses it when there are more than 1024 input terms (an arbitrary threshold!).

Performance is significantly improved for large TermSets of primitives.

… with a fast field column.

PSeitz · 2025-10-16T09:08:06Z

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

stuhood · 2025-10-16T16:42:44Z

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

Hm, interesting.

That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.

For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.

* Removes allocation in a bunch of places * Removes sorting of terms if we're going to use the fast field execution method * Adds back the (accidentally dropped) cardinality threshold * Removes `bool` support -- using the posting lists is always more efficient for a `bool`, since there are at most two of them * More eagerly constructs the term `HashSet` so that it happens once, rather than once per segment

PSeitz · 2025-10-26T18:19:55Z

I think a precomputed BitSet should also work well, similar to what I did in regex_phrase_weight

Hm, interesting.

That could be implemented using either the posting lists or fast fields, I assume? So it seems like it's independent of which index we use.

For very high term counts, I think that you'll still want to scan the fast field, since it is so dense? Requires benchmarking probably.

The BitSet would be precomputed on the inverted index. I think we would need a benchmark to justify the additional code. (also good to know :)

stuhood · 2025-10-26T22:11:13Z

I think we would need a benchmark to justify the additional code.

FWIW: The benchmarks that I did in ParadeDB for this PR are over here: paradedb/paradedb#3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.

PSeitz · 2025-10-27T08:18:27Z

I think we would need a benchmark to justify the additional code.

FWIW: The benchmarks that I did in ParadeDB for this PR are over here: paradedb/paradedb#3412 ... for large term sets, consuming the posting lists uses a very large amount of memory, and does a lot of seeking: this implementation is about 7 times faster in one worker process. Can ignore the "multiple workers as pessimization" aspect: that's specific to having a huge query in ParadeDB.

I mean comparing the BitSet variant (via BitSetDocSet) with the fastfield variant.
In the BitSet variant we would fill the BitSet one by one from the postinglists, which would fix the memory usage to num_docs/8 bytes and remove all seeking.

Add a fast-field TermSet implementation which intersects the term set…

672e26a

… with a fast field column.

stuhood mentioned this pull request Oct 16, 2025

perf: Implement a TermSet variant which uses fast fields paradedb/tantivy#69

Merged

stuhood and others added 3 commits October 16, 2025 12:36

Add support for bool.

cb1100c

Fix RangeQuery comment.

ec8ec19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add a fast-field variant of TermSet #2718

Add a fast-field variant of TermSet #2718

Uh oh!

stuhood commented Oct 15, 2025

Uh oh!

PSeitz commented Oct 16, 2025

Uh oh!

stuhood commented Oct 16, 2025

Uh oh!

PSeitz commented Oct 26, 2025

Uh oh!

stuhood commented Oct 26, 2025

Uh oh!

PSeitz commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add a fast-field variant of TermSet #2718

Are you sure you want to change the base?

Add a fast-field variant of TermSet #2718

Uh oh!

Conversation

stuhood commented Oct 15, 2025

Uh oh!

PSeitz commented Oct 16, 2025

Uh oh!

stuhood commented Oct 16, 2025

Uh oh!

PSeitz commented Oct 26, 2025

Uh oh!

stuhood commented Oct 26, 2025

Uh oh!

PSeitz commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants