-
Notifications
You must be signed in to change notification settings - Fork 123
feat: stats pruning for LIKE filters #6049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merging this PR will degrade performance by 40.14%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | WallTime | u16_FoR[10M] |
6.5 µs | 10.8 µs | -40.14% |
Comparing aduffy/like-pushdown (e35f51d) with develop (a095ca7)
Footnotes
-
1290 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
8f4900b to
51f3fa7
Compare
51f3fa7 to
194c964
Compare
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e631ebd to
fbff6c0
Compare
|
FWIW the stats truncation logic has the increment utf8 behaviour you want. It's implemented on a scalar https://github.com/vortex-data/vortex/blob/develop/vortex-scalar/src/binary.rs#L101 so it can mutate it in place. I wonder if they just have to stay separate but maybe upper bound being in place for binary scalar is just a cute optimistaion |
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
|
That function is missing some of the UTF-8 special character handling in the DF implementation, so now I'm worried that one of us is doing the wrong thing... |
|
I gave you the binary version of this function, the utf8 version is here https://github.com/vortex-data/vortex/blob/develop/vortex-scalar/src/utf8.rs#L98 apologies |
|
I was looking at the utf8 version, I guess I don't know enough about utf8 to know if this covers the same edge cases that the datafusion implementation does |
|
I derived it from the same code so we should make sure it’s the same so stats behave the same as regular filters |
3927c66 to
c24c047
Compare
| // Attempt to do min/max pruning for LIKE 'exact' or LIKE 'prefix%' | ||
|
|
||
| // Don't attempt to handle ilike or negated like | ||
| if like_opts.negated || like_opts.case_insensitive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we could handle negated like if we know that the block ONLY contains things which match the prefix, but we'd need to add a starts_with expression and I didn't want to do that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, we can do this without a starts_with, if we see that min >= prefix AND min < succ AND max >= prefix AND max < succ AND null_count = 0
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
Signed-off-by: Andrew Duffy <[email protected]>
2e47faf to
e35f51d
Compare
Fixes #6128
There are two types of
LIKEfilters that we can prune using our zone maps:col LIKE 'exact'col.min > 'exact' OR col.max < 'exact'col LIKE 'prefix%'col.min >= 'prefiy' OR col.max < 'prefix'I extracted out the existing logic from Utf8Scalar::upper_bound into a trait so we can make sure logic is consistent across stats gen/pruning