perf: Optimize startsWith and endsWith string functions #3000

Mazen-Ghanaym · 2025-12-27T17:26:46Z

Which issue does this PR close?

Closes #2973.

Rationale for this change

The startsWith and endsWith string functions were previously delegated to DataFusion's built-in scalar functions, which introduced unnecessary overhead and did not fully leverage Comet's native execution capabilities. This PR implements optimized native expressions to improve performance.

What changes are included in this PR?

This PR introduces custom StartsWithExpr and EndsWithExpr physical expressions with the following optimizations:

startsWith:

Uses Arrow's compute::starts_with kernel with a pre-allocated pattern array to avoid per-batch allocations.
Achieves 1.1X speedup over Spark.

endsWith:

Uses direct buffer access to the underlying StringArray data, bypassing iterator overhead.
Manually calculates suffix offsets and performs raw byte slice comparison (memcmp).
Achieves 1.0X parity with Spark (improved from 0.9X regression).

Files Changed:

native/spark-expr/src/string_funcs/starts_ends_with.rs (NEW)
native/spark-expr/src/string_funcs/mod.rs
native/core/src/execution/expressions/strings.rs
native/core/src/execution/planner/expression_registry.rs
spark/src/main/scala/org/apache/comet/serde/strings.scala
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
spark/src/test/scala/org/apache/spark/sql/benchmark/CometStringExpressionBenchmark.scala

How are these changes tested?

Existing Tests: The implementation passes all existing Comet tests, including TPC-DS and TPC-H correctness suites which exercise string functions.
Benchmark Verification: Performance was verified using CometStringExpressionBenchmark:
- startsWith: 1.1X faster than Spark (Comet 1887ms vs Spark 2028ms)
- endsWith: 1.0X parity with Spark (Comet 3389ms vs Spark 3354ms)
CI Verification: A temporary workflow was used to verify the benchmark executes correctly in GitHub Actions CI environment and the results are in the Benchmark Results

Benchmark Results

Environment: OpenJDK 64-Bit Server VM 11.0.29+7-LTS on Linux 6.11.0-1018-azure
Processor: AMD EPYC 7763 64-Core Processor

startsWith

Case	Best Time(ms)	Avg Time(ms)	Stdev(ms)	Rate(M/s)	Per Row(ns)	Relative
Spark	1657	1669	17	0.6	1580.2	1.0X
Comet (Scan)	1740	1755	20	0.6	1659.6	1.0X
Comet (Scan + Exec)	1546	1546	1	0.7	1474.1	1.1X

endsWith

Case	Best Time(ms)	Avg Time(ms)	Stdev(ms)	Rate(M/s)	Per Row(ns)	Relative
Spark	1625	1632	10	0.6	1549.9	1.0X
Comet (Scan)	1731	1732	1	0.6	1651.2	0.9X
Comet (Scan + Exec)	1562	1563	0	0.7	1490.0	1.0X

Summary:

startsWith: 1.1X faster than Spark (1546ms vs 1657ms)
endsWith: 1.0X parity with Spark (1562ms vs 1625ms)

- Implements a hybrid optimization strategy for string primitives. - Uses Arrow compute kernels for startsWith with pre-allocated pattern arrays to avoid per-batch allocation overhead. - Uses direct buffer access and manual suffix calculation for endsWith to bypass iterator overhead and match JVM intrinsic performance. - Achieves 1.1X speedup for startsWith and parity (1.0X) for endsWith compared to Spark. Closes apache#2973

coderfender · 2025-12-27T17:32:46Z

Thank you for the PR @Mazen-Ghanaym . Any reason why we can't make the Datafusion's version faster here?

Mazen-Ghanaym · 2025-12-27T17:56:02Z

Thank you for the PR @Mazen-Ghanaym . Any reason why we can't make the Datafusion's version faster here?

I spent around 4 days trying to optimize, and here is the short story of the journey
Day 1-2: Tried DataFusion built-ins Started with ScalarFunctionExpr using DataFusion's native starts_with/ends_with. Result: 1.0X – just matched Spark, no improvement.

Day 2: Tried unsafe Rust with direct byte access Attempted raw pointer manipulation for maximum speed. Failed due to compatibility issues.

Day 3: Tried safe Rust with stdlib slices. Used .starts_with() and .ends_with() on string slices with manual iteration. Result: 0.9X – actually slower than Spark, I think it's due to iterator overhead.

Day 3-4: Arrow compute kernels + pre-allocated pattern The breakthrough: I realized the overhead came from repeatedly processing the pattern each batch. Pre-allocating the pattern as a StringArray once and calling Arrow's compute::starts_with directly gave us 1.1X for startsWith.

For endsWith, Arrow's kernel was still slightly slower (0.9X), so I went with direct buffer access and manual suffix calculation to reach 1.0X parity.

I tried to optimize further but I don't know any optimizations that can beat Java in these direct, simple operations.

andygrove · 2026-01-02T20:59:50Z

I will start reviewing this PR today.

There was a PR merged into DataFusion last week to speed up starts_with / ends_with there, so we may want to switch to that in the future.

apache/datafusion#19516

andygrove · 2026-01-02T21:05:24Z

native/spark-expr/src/string_funcs/starts_ends_with.rs

+                let offsets = string_array.value_offsets();
+                let values = string_array.value_data();
+                let pattern_bytes = self.pattern.as_bytes();
+                let p_len = self.pattern_len;


Why can't we just use Arrow's ends_with kernel, similar to how starts_with is handled?

I actually tried that first! Used arrow::compute::ends_with with the same Scalar approach as starts_with, but it was still ~0.9X vs Spark at the time. The direct buffer access is what got it to 1.0X parity.

I saw the DataFusion PR you linked (apache/datafusion#19516), looks like it adds the same Scalar optimization. Once Comet picks up that version, happy to switch this to the standard kernel and remove the custom code!

andygrove · 2026-01-02T21:05:48Z

native/spark-expr/src/string_funcs/starts_ends_with.rs

+                // Use Arrow's highly optimized SIMD kernel
+                let result = compute::starts_with(&array, &scalar)?;


codecov-commenter · 2026-01-02T21:22:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.62%. Comparing base (f09f8af) to head (d2f0aea).
⚠️ Report is 811 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3000      +/-   ##
============================================
+ Coverage     56.12%   59.62%   +3.50%     
- Complexity      976     1377     +401     
============================================
  Files           119      167      +48     
  Lines         11743    15509    +3766     
  Branches       2251     2569     +318     
============================================
+ Hits           6591     9248    +2657     
- Misses         4012     4962     +950     
- Partials       1140     1299     +159

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Mazen-Ghanaym and others added 6 commits December 27, 2025 03:32

Merge branch 'apache:main' into feat/optimize-strings-2973

aabf3f3

Fix code formatting for CI check

9248b70

Fix clippy lints

122623e

Add temp benchmark workflow for CI verification

fe9cabb

Remove temp benchmark workflow

d2f0aea

Mazen-Ghanaym changed the title ~~Feat/optimize strings 2973~~ feat: Optimize startsWith and endsWith string functions Dec 27, 2025

Mazen-Ghanaym changed the title ~~feat: Optimize startsWith and endsWith string functions~~ perf: Optimize startsWith and endsWith string functions Dec 27, 2025

andygrove reviewed Jan 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Optimize startsWith and endsWith string functions #3000

perf: Optimize startsWith and endsWith string functions #3000

Uh oh!

Mazen-Ghanaym commented Dec 27, 2025

Uh oh!

coderfender commented Dec 27, 2025

Uh oh!

Mazen-Ghanaym commented Dec 27, 2025

Uh oh!

andygrove commented Jan 2, 2026

Uh oh!

andygrove Jan 2, 2026

Uh oh!

Mazen-Ghanaym Jan 2, 2026

Uh oh!

andygrove Jan 2, 2026

Uh oh!

codecov-commenter commented Jan 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Use Arrow's highly optimized SIMD kernel
		let result = compute::starts_with(&array, &scalar)?;

perf: Optimize startsWith and endsWith string functions #3000

Are you sure you want to change the base?

perf: Optimize startsWith and endsWith string functions #3000

Uh oh!

Conversation

Mazen-Ghanaym commented Dec 27, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Benchmark Results

startsWith

endsWith

Uh oh!

coderfender commented Dec 27, 2025

Uh oh!

Mazen-Ghanaym commented Dec 27, 2025

Uh oh!

andygrove commented Jan 2, 2026

Uh oh!

andygrove Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Mazen-Ghanaym Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jan 2, 2026 •

edited

Loading