Implement disk spilling for all grouping ordering modes in GroupedHashAggregateStream #19287

pepijnve · 2025-12-11T15:44:18Z

Which issue does this PR close?

Closes Grouped aggregations with many distinct groups do not respect memory limit when input is sorted #19286.
Related to [EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123

Rationale for this change

GroupedHashAggregateStream currently always reports that it can spill to the memory tracking subsystem even though this is dependent on the aggregation mode and the grouping order.
The optimistic logic in group_aggregate_batch does not correctly take the spilling preconditions into account which can lead to excessive memory use.

In order to to resolve this, this PR implements disk spilling for all grouping modes.

What changes are included in this PR?

Correctly set MemoryConsumer::can_spill to reflect actual spilling behaviour
Ensure optimistic out-of-memory tolerance in group_aggregate_batch is aligned with disk spilling or early emission logic
Implement output order respecting disk spilling for partially and fully sorted inputs.

Are these changes tested?

Added additional test case to demonstrate problem.
Added test case to check that output order is respected after spilling.

Are there any user-facing changes?

Yes, memory exhaustion may be reported much earlier in the query pipeline than is currently the case. In my local tests with a per consumer memory limit of 32MiB, grouped aggregation would consume 480MiB in practice. This was then reported by ExternalSortExec which choked on trying to reserve that much memory.

pepijnve · 2025-12-11T16:13:02Z

~~wasm test failure seems unrelated.~~ seems to have been a fluke

pepijnve · 2025-12-11T19:09:37Z

Extended test failures are related to the changes in this PR. Will review as soon as I can.

pepijnve · 2025-12-12T18:42:34Z

I've extended this PR to include spill support for grouping by partial or fully sorted inputs. This fixed the previously failing tests.

pepijnve · 2025-12-14T15:56:36Z

datafusion/physical-plan/src/aggregates/mod.rs

+
+        let mut new_requirements = new_sort_exprs
+            .into_iter()
+            .map(PhysicalSortRequirement::from)


Rather than creating new sort requirements with default options, this copies the sort options from the input instead.

pepijnve · 2025-12-14T15:58:48Z

datafusion/physical-plan/src/aggregates/row_hash.rs

        let group_ordering = GroupOrdering::try_new(&agg.input_order_mode)?;
+        let oom_mode = match group_ordering {
+            GroupOrdering::None => {
+                if agg.mode == AggregateMode::Partial {


The intention here is to keep the earlier emit early behaviour for AggregateMode::Partial.

pepijnve · 2025-12-15T11:34:02Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+fn find_sort_options(
+    output_ordering: Option<&LexOrdering>,
+    expr: &dyn PhysicalExpr,
+) -> SortOptions {
+    if let Some(ordering) = output_ordering {
+        for e in ordering {
+            if e.expr.as_ref().dyn_eq(expr) {
+                return e.options;
+            }
+        }
+    }
+
+    SortOptions::default()
+}
+


I wasn't 100% sure about this implementation. Perhaps there's a better (more elegant) way to figure out what the correct SortOptions are.

Maybe we could add a method to LexOrdering -- so this could be like

LexOrdering::find_options(expr).unwrap_or_default()

alamb

Thank you @pepijnve -- this looks very close to me . I reviewed the code carefully and the only thing I have a concern about is spilling the extra batch. Otherwise this looks good to me.

I will kick off some benchmarks just to me sure

cc @rluvaton and @2010YOUY01 who I think have also been working in this code recently

alamb · 2025-12-15T20:09:49Z

datafusion/physical-plan/src/aggregates/mod.rs

    }

+    #[tokio::test]
+    async fn test_grouped_aggregation_respects_memory_limit() -> Result<()> {


I verified these tests cover the new code as they fail without the code in this PR

failures: ---- aggregates::tests::test_grouped_aggregation_respects_memory_limit stdout ---- thread 'aggregates::tests::test_grouped_aggregation_respects_memory_limit' (25899629) panicked at datafusion/physical-plan/src/aggregates/mod.rs:3384:17: Expected spill but SpillCount metric not found or SpillCount was 0. note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ---- aggregates::tests::test_order_is_retained_when_spilling stdout ---- thread 'aggregates::tests::test_order_is_retained_when_spilling' (25899632) panicked at datafusion/physical-plan/src/aggregates/mod.rs:3384:17: Expected spill but SpillCount metric not found or SpillCount was 0. failures: aggregates::tests::test_grouped_aggregation_respects_memory_limit aggregates::tests::test_order_is_retained_when_spilling

alamb · 2025-12-15T20:23:07Z

datafusion/physical-plan/src/aggregates/row_hash.rs

        // aggregation operator, we must use a schema that includes both the group
        // columns **and** the partial-state columns.
-        let partial_agg_schema = create_schema(
+        let spill_schema = Arc::new(create_schema(


given that this is used for spilling (per the comments) renaming these variables makes sense to me

alamb · 2025-12-15T20:24:10Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+                    let output_expr = Column::new(field.name().as_str(), idx);
+
+                    // Try to use the sort options from the output ordering, if available.
+                    // This ensures that spilled state is emitted in the expected order.


alamb · 2025-12-15T20:26:40Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+fn find_sort_options(
+    output_ordering: Option<&LexOrdering>,
+    expr: &dyn PhysicalExpr,
+) -> SortOptions {
+    if let Some(ordering) = output_ordering {
+        for e in ordering {
+            if e.expr.as_ref().dyn_eq(expr) {
+                return e.options;
+            }
+        }
+    }
+
+    SortOptions::default()
+}
+


Maybe we could add a method to LexOrdering -- so this could be like

LexOrdering::find_options(expr).unwrap_or_default()

alamb · 2025-12-15T20:29:17Z

datafusion/physical-plan/src/aggregates/row_hash.rs

    }

+    /// Determines if `spill_state_if_oom` can free up memory by spilling state to disk
+    fn can_spill_on_oom(&self) -> bool {


nit would be to move this closer in the code to can_emit_early_on_oom so they are closer together in the code

alamb · 2025-12-15T20:34:17Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+        // Spill the last remaining rows (if any) to free up as much memory as possible.
+        // Since we're already spilling, we can be sure we're memory constrained.
+        // Creating an extra spill file won't make much of a difference.
+        self.spill()?;


It sort of depends on how large the spill file was -- like if we have 2GB of data in memorying, writing it to a new spill file will likely make a measurable negative difference in performance

I had my doubts about this one myself due to performance regression concerns. On the other hand, the previous code was kind of cheating in the sense that the last batch is held in memory, and unless I'm mistaken, is no longer accounted for in the memory reservation.

BTW, this is how sort exec does - always spill when have in memory and about to do external sort

Agreed -- this approach makes sense for now. We can try and optimize it in the future if needed

alamb · 2025-12-15T20:36:36Z

datafusion/physical-plan/src/aggregates/row_hash.rs

        // clear up memory for streaming_merge
        self.clear_all();
        self.update_memory_reservation()?;
-        let mut streams: Vec<SendableRecordBatchStream> = vec![];


This change is because the code now forces a spill even if it had memory to hold the last batch in memory, right?

I feel like while the code is now cleaner we have lost some performance

Maybe we could factor the creation of the stream into a method so it looks something like this

if let Some(stream) = self.emit_to_stream(EmitTo:All, true) { builder = builder.with_streams(vec![stream])); }

Yes, that's correct. By spilling the last batch the special handling is no longer required. Clean code wasn't really the main driver though. Accounting for batch in the memory reservation was.

alamb · 2025-12-15T20:40:19Z

run benchmarks

alamb · 2025-12-15T20:40:25Z

run benchmark external_aggr

alamb-ghbot · 2025-12-15T20:40:30Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (d9af260) to e914935 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb-ghbot · 2025-12-15T21:21:26Z

🤖: Benchmark completed

Details

Comparing HEAD and issue_19286
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_19286 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2768.36 ms │  2781.27 ms │ no change │
│ QQuery 1     │  1210.78 ms │  1206.99 ms │ no change │
│ QQuery 2     │  2327.35 ms │  2416.15 ms │ no change │
│ QQuery 3     │  1183.49 ms │  1180.56 ms │ no change │
│ QQuery 4     │  2325.06 ms │  2298.98 ms │ no change │
│ QQuery 5     │ 29417.81 ms │ 28624.36 ms │ no change │
│ QQuery 6     │  4148.20 ms │  4091.45 ms │ no change │
│ QQuery 7     │  3650.20 ms │  3630.43 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 47031.25ms │
│ Total Time (issue_19286)   │ 46230.18ms │
│ Average Time (HEAD)        │  5878.91ms │
│ Average Time (issue_19286) │  5778.77ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          8 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_19286 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.72 ms │     2.58 ms │ +1.06x faster │
│ QQuery 1     │    50.10 ms │    51.38 ms │     no change │
│ QQuery 2     │   138.01 ms │   136.19 ms │     no change │
│ QQuery 3     │   157.51 ms │   160.57 ms │     no change │
│ QQuery 4     │  1098.11 ms │  1119.95 ms │     no change │
│ QQuery 5     │  1500.24 ms │  1539.26 ms │     no change │
│ QQuery 6     │     2.32 ms │     2.31 ms │     no change │
│ QQuery 7     │    56.13 ms │    55.90 ms │     no change │
│ QQuery 8     │  1427.96 ms │  1490.07 ms │     no change │
│ QQuery 9     │  1875.95 ms │  1981.70 ms │  1.06x slower │
│ QQuery 10    │   378.53 ms │   362.97 ms │     no change │
│ QQuery 11    │   410.78 ms │   411.68 ms │     no change │
│ QQuery 12    │  1344.05 ms │  1409.03 ms │     no change │
│ QQuery 13    │  2038.46 ms │  2032.92 ms │     no change │
│ QQuery 14    │  1263.64 ms │  1304.08 ms │     no change │
│ QQuery 15    │  1257.14 ms │  1228.76 ms │     no change │
│ QQuery 16    │  2647.13 ms │  2698.07 ms │     no change │
│ QQuery 17    │  2641.55 ms │  2717.83 ms │     no change │
│ QQuery 18    │  4998.71 ms │  4972.88 ms │     no change │
│ QQuery 19    │   124.97 ms │   125.49 ms │     no change │
│ QQuery 20    │  1930.71 ms │  1928.07 ms │     no change │
│ QQuery 21    │  2261.38 ms │  2211.14 ms │     no change │
│ QQuery 22    │  3892.42 ms │  3826.63 ms │     no change │
│ QQuery 23    │ 12536.42 ms │ 12495.60 ms │     no change │
│ QQuery 24    │   219.14 ms │   205.82 ms │ +1.06x faster │
│ QQuery 25    │   460.07 ms │   462.27 ms │     no change │
│ QQuery 26    │   225.23 ms │   220.45 ms │     no change │
│ QQuery 27    │  2815.01 ms │  2776.16 ms │     no change │
│ QQuery 28    │ 24048.63 ms │ 23714.54 ms │     no change │
│ QQuery 29    │  1009.00 ms │  1006.84 ms │     no change │
│ QQuery 30    │  1343.16 ms │  1329.69 ms │     no change │
│ QQuery 31    │  1379.58 ms │  1335.89 ms │     no change │
│ QQuery 32    │  4738.56 ms │  4639.03 ms │     no change │
│ QQuery 33    │  5915.69 ms │  6034.39 ms │     no change │
│ QQuery 34    │  6102.79 ms │  6002.09 ms │     no change │
│ QQuery 35    │  2032.87 ms │  1890.81 ms │ +1.08x faster │
│ QQuery 36    │   120.08 ms │   119.18 ms │     no change │
│ QQuery 37    │    56.33 ms │    57.36 ms │     no change │
│ QQuery 38    │   123.82 ms │   119.64 ms │     no change │
│ QQuery 39    │   199.18 ms │   190.88 ms │     no change │
│ QQuery 40    │    47.56 ms │    44.04 ms │ +1.08x faster │
│ QQuery 41    │    42.18 ms │    42.97 ms │     no change │
│ QQuery 42    │    36.59 ms │    37.29 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 94950.38ms │
│ Total Time (issue_19286)   │ 94494.42ms │
│ Average Time (HEAD)        │  2208.15ms │
│ Average Time (issue_19286) │  2197.54ms │
│ Queries Faster             │          4 │
│ Queries Slower             │          1 │
│ Queries with No Change     │         38 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_19286 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 134.20 ms │   136.34 ms │     no change │
│ QQuery 2     │  28.81 ms │    30.21 ms │     no change │
│ QQuery 3     │  40.18 ms │    40.19 ms │     no change │
│ QQuery 4     │  29.77 ms │    30.70 ms │     no change │
│ QQuery 5     │  89.80 ms │    90.01 ms │     no change │
│ QQuery 6     │  20.38 ms │    20.15 ms │     no change │
│ QQuery 7     │ 237.37 ms │   230.91 ms │     no change │
│ QQuery 8     │  36.99 ms │    38.81 ms │     no change │
│ QQuery 9     │ 109.06 ms │   107.79 ms │     no change │
│ QQuery 10    │  65.23 ms │    65.33 ms │     no change │
│ QQuery 11    │  19.01 ms │    18.58 ms │     no change │
│ QQuery 12    │  52.76 ms │    51.70 ms │     no change │
│ QQuery 13    │  48.84 ms │    50.40 ms │     no change │
│ QQuery 14    │  14.48 ms │    15.95 ms │  1.10x slower │
│ QQuery 15    │  24.77 ms │    24.87 ms │     no change │
│ QQuery 16    │  27.06 ms │    26.55 ms │     no change │
│ QQuery 17    │ 164.84 ms │   153.12 ms │ +1.08x faster │
│ QQuery 18    │ 281.56 ms │   285.74 ms │     no change │
│ QQuery 19    │  38.84 ms │    39.09 ms │     no change │
│ QQuery 20    │  50.59 ms │    50.12 ms │     no change │
│ QQuery 21    │ 326.03 ms │   327.57 ms │     no change │
│ QQuery 22    │  18.84 ms │    18.21 ms │     no change │
└──────────────┴───────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1859.43ms │
│ Total Time (issue_19286)   │ 1852.34ms │
│ Average Time (HEAD)        │   84.52ms │
│ Average Time (issue_19286) │   84.20ms │
│ Queries Faster             │         1 │
│ Queries Slower             │         1 │
│ Queries with No Change     │        20 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

alamb-ghbot · 2025-12-16T12:00:57Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (d9af260) to e914935 diff using: external_aggr
Results will be posted here when complete

pepijnve · 2025-12-16T18:51:43Z

Benchmark runner seems to be struggling

alamb · 2025-12-16T20:54:14Z

Benchmark runner seems to be struggling

It seems that the benchmark doesn't complete on main (I haven't tried on this branch)

     Running `/Users/andrewlamb/Software/datafusion/target/release/external_aggr benchmark --partitions 4 --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/tpch_sf1 -o /Users/andrewlamb/Software/datafusion/benchmarks/results/main/external_aggr.json`
Q1(64.0 MB) iteration 0 took 66.5 ms and returned 1 rows
Q1(64.0 MB) iteration 1 took 40.4 ms and returned 1 rows
Q1(64.0 MB) iteration 2 took 41.3 ms and returned 1 rows
Q1(64.0 MB) iteration 3 took 43.1 ms and returned 1 rows
Q1(64.0 MB) iteration 4 took 40.6 ms and returned 1 rows
Q1(64.0 MB) avg time: 46.38 ms
Error: Shared(ResourcesExhausted("Failed to allocate additional 1850.4 KB for GroupedHashAggregateStream[1] () with 1812.6 KB already allocated for this reservation - 2.7 MB remain available for the total pool"))

pepijnve · 2025-12-16T21:17:05Z

Interesting. I’ll have a look at that tomorrow.

pepijnve · 2025-12-18T10:21:37Z

@alamb 9f2f22e makes the external_aggr benchmark succeed.

Because I was setting can spill: false for OutOfMemoryMode::EmitEarly, there was in effect still greedy memory reservation despite the use of FairSpillPool. That was causing one of the streams to not be able to reserve even a minuscule amount of memory.

Additionally the 'emit early' code would only kick in if there was at least one full batch of group values to emit. I've modified the code to still try to do that if it can, but fall back to emitting a smaller batch if memory pressure requires it.

pepijnve · 2025-12-18T13:24:12Z

One remaining question I have is how skip_aggregation_probe and group_values_soft_limit are expected to interact with the memory reservation system.

In SkipAggregationProbe::update_state the total number of input rows is accumulated, but the only the current number of group values is compared against that. If disk spilling or early emission is in effect, then the consequence is that the ratio will only decrease since the denominator of the ratio keeps on increasing. In other words, it's unlikely to ever kick in.

The same goes for group_values_soft_limit. It's compared against the number of group values that are held in memory by the aggregation stream itself. If we keep flushing values either down the pipeline or to disk, it's also unlikely to ever trigger.

…eam when spilling is not possible

…egations

- Report 'can spill: true' for emit early mode - Emit fewer than 'batch size' group values if necessary

…ut-of-memory error

kazuyukitanimura

Thank you @pepijnve

kazuyukitanimura · 2025-12-18T22:22:32Z

datafusion/physical-plan/src/aggregates/row_hash.rs

    /// At this point, all the inputs are read and there are some spills.
    /// Emit the remaining rows and create a batch.
    /// Conduct a streaming merge sort between the batch and spilled data. Since the stream is fully
    /// sorted, set `self.group_ordering` to Full, then later we can read with [`EmitTo::First`].


Should we update this comment because we now always spill the remaining rows?

Also, is there any way to measure the impact of this change in a micro benchmark?

Good catch. Yes, I'll revise that comment a bit.

I've made some adjustments that allow benchmarks/src/bin/external_aggr.rs to run to completion again (it was broken on main). I haven't inspected the query plans it uses yet to see which code path it's actually covering.

Because it wasn't working on main it's not possible to use it for comparative testing.

I decided to inline update_merged_stream into set_input_done_and_produce_output. I couldn't come up with a sensible name for it, and found it to be clearer if the two alternatives in set_input_done_and_produce_output are close together.

I've added additional comments in set_input_done_and_produce_output to explain what's being done there and why.

In external_aggr.rs, the query plan for Q1 is

ProjectionExec: expr=[count(Int64(1))@0 as count(*)] AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] CoalescePartitionsExec AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] ProjectionExec: expr=[] AggregateExec: mode=FinalPartitioned, gby=[l_orderkey@0 as l_orderkey, l_suppkey@1 as l_suppkey], aggr=[] RepartitionExec: partitioning=Hash([l_orderkey@0, l_suppkey@1], 4), input_partitions=4 AggregateExec: mode=Partial, gby=[l_orderkey@0 as l_orderkey, l_suppkey@1 as l_suppkey], aggr=[] DataSourceExec: ...

I've confirmed with the debugger that a mix of OutOfMemoryMode::EmitEarly and OutOfMemoryMode::Spill is used. So that benchmark is definitely useful to measure the performance impact of changes in this area of the code.

@alamb Is it worth trying to get it in working state again on main with only minimal changes first? My appetite for that work is rather limited unless it's considered a blocker for this PR.

@alamb Is it worth trying to get it in working state again on main with only minimal changes first? My appetite for that work is rather limited unless it's considered a blocker for this PR.

I don't think so. If we hit issues we can always make a new PR

alamb · 2025-12-19T22:30:40Z

I think this one is ready to go. I'll just update it once more and run benchmarks one more time to make me feel better and plan to merge it in

alamb · 2025-12-19T22:30:47Z

run benchmark clickbench_partitioned

alamb-ghbot · 2025-12-19T22:31:05Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (2ea0e3d) to ead8209 diff using: clickbench_partitioned
Results will be posted here when complete

alamb · 2025-12-19T22:31:11Z

run benchmark tpch

alamb-ghbot · 2025-12-19T23:02:04Z

🤖: Benchmark completed

Details

Comparing HEAD and issue_19286
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_19286 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.29 ms │     2.54 ms │  1.11x slower │
│ QQuery 1     │    49.15 ms │    49.71 ms │     no change │
│ QQuery 2     │   135.30 ms │   131.44 ms │     no change │
│ QQuery 3     │   158.94 ms │   152.10 ms │     no change │
│ QQuery 4     │  1077.08 ms │  1094.73 ms │     no change │
│ QQuery 5     │  1504.96 ms │  1499.92 ms │     no change │
│ QQuery 6     │     2.17 ms │     2.10 ms │     no change │
│ QQuery 7     │    55.89 ms │    55.15 ms │     no change │
│ QQuery 8     │  1444.52 ms │  1446.30 ms │     no change │
│ QQuery 9     │  1867.87 ms │  1866.52 ms │     no change │
│ QQuery 10    │   362.73 ms │   362.69 ms │     no change │
│ QQuery 11    │   415.56 ms │   423.89 ms │     no change │
│ QQuery 12    │  1348.89 ms │  1347.85 ms │     no change │
│ QQuery 13    │  2007.19 ms │  2053.80 ms │     no change │
│ QQuery 14    │  1286.53 ms │  1277.48 ms │     no change │
│ QQuery 15    │  1266.86 ms │  1283.56 ms │     no change │
│ QQuery 16    │  2680.37 ms │  2685.34 ms │     no change │
│ QQuery 17    │  2667.04 ms │  2684.80 ms │     no change │
│ QQuery 18    │  5779.79 ms │  4956.34 ms │ +1.17x faster │
│ QQuery 19    │   123.39 ms │   122.35 ms │     no change │
│ QQuery 20    │  1979.79 ms │  1893.95 ms │     no change │
│ QQuery 21    │  2249.46 ms │  2178.91 ms │     no change │
│ QQuery 22    │  3832.69 ms │  3714.92 ms │     no change │
│ QQuery 23    │ 18395.94 ms │ 12136.24 ms │ +1.52x faster │
│ QQuery 24    │   215.60 ms │   212.90 ms │     no change │
│ QQuery 25    │   463.47 ms │   452.82 ms │     no change │
│ QQuery 26    │   227.38 ms │   226.94 ms │     no change │
│ QQuery 27    │  2828.96 ms │  2760.40 ms │     no change │
│ QQuery 28    │ 25164.08 ms │ 21938.93 ms │ +1.15x faster │
│ QQuery 29    │   964.94 ms │   974.72 ms │     no change │
│ QQuery 30    │  1364.88 ms │  1316.22 ms │     no change │
│ QQuery 31    │  1356.24 ms │  1322.71 ms │     no change │
│ QQuery 32    │  5098.17 ms │  5120.02 ms │     no change │
│ QQuery 33    │  6036.11 ms │  5869.18 ms │     no change │
│ QQuery 34    │  6409.47 ms │  6356.96 ms │     no change │
│ QQuery 35    │  1986.02 ms │  1939.25 ms │     no change │
│ QQuery 36    │    66.26 ms │    70.39 ms │  1.06x slower │
│ QQuery 37    │    45.43 ms │    44.82 ms │     no change │
│ QQuery 38    │    67.59 ms │    66.48 ms │     no change │
│ QQuery 39    │   105.42 ms │   100.53 ms │     no change │
│ QQuery 40    │    27.78 ms │    26.24 ms │ +1.06x faster │
│ QQuery 41    │    24.04 ms │    23.77 ms │     no change │
│ QQuery 42    │    21.02 ms │    19.77 ms │ +1.06x faster │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 103167.27ms │
│ Total Time (issue_19286)   │  92265.70ms │
│ Average Time (HEAD)        │   2399.24ms │
│ Average Time (issue_19286) │   2145.71ms │
│ Queries Faster             │           5 │
│ Queries Slower             │           2 │
│ Queries with No Change     │          36 │
│ Queries with Failure       │           0 │
└────────────────────────────┴─────────────┘

alamb-ghbot · 2025-12-19T23:02:09Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (2ea0e3d) to ead8209 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2025-12-19T23:02:48Z

🤖: Benchmark completed

Details

Comparing HEAD and issue_19286
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_19286 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 221.41 ms │   219.18 ms │ no change │
│ QQuery 2     │  97.81 ms │    96.65 ms │ no change │
│ QQuery 3     │ 130.09 ms │   125.76 ms │ no change │
│ QQuery 4     │  77.41 ms │    75.86 ms │ no change │
│ QQuery 5     │ 178.62 ms │   171.12 ms │ no change │
│ QQuery 6     │  64.82 ms │    67.91 ms │ no change │
│ QQuery 7     │ 223.95 ms │   220.97 ms │ no change │
│ QQuery 8     │ 164.51 ms │   166.09 ms │ no change │
│ QQuery 9     │ 225.75 ms │   228.03 ms │ no change │
│ QQuery 10    │ 190.29 ms │   187.33 ms │ no change │
│ QQuery 11    │  75.81 ms │    79.26 ms │ no change │
│ QQuery 12    │ 117.43 ms │   118.47 ms │ no change │
│ QQuery 13    │ 220.44 ms │   226.54 ms │ no change │
│ QQuery 14    │  93.13 ms │    90.60 ms │ no change │
│ QQuery 15    │ 124.64 ms │   122.01 ms │ no change │
│ QQuery 16    │  58.59 ms │    59.09 ms │ no change │
│ QQuery 17    │ 279.57 ms │   278.04 ms │ no change │
│ QQuery 18    │ 319.34 ms │   322.98 ms │ no change │
│ QQuery 19    │ 134.92 ms │   137.04 ms │ no change │
│ QQuery 20    │ 126.04 ms │   125.36 ms │ no change │
│ QQuery 21    │ 267.19 ms │   261.86 ms │ no change │
│ QQuery 22    │  43.41 ms │    44.01 ms │ no change │
└──────────────┴───────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 3435.15ms │
│ Total Time (issue_19286)   │ 3424.14ms │
│ Average Time (HEAD)        │  156.14ms │
│ Average Time (issue_19286) │  155.64ms │
│ Queries Faster             │         0 │
│ Queries Slower             │         0 │
│ Queries with No Change     │        22 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

alamb · 2025-12-20T11:22:30Z

Thanks again @pepijnve and @kazuyukitanimura

github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 11, 2025

pepijnve force-pushed the issue_19286 branch from b8de4b9 to 8cdddb0 Compare December 11, 2025 15:50

pepijnve mentioned this pull request Dec 11, 2025

[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123

Open

3 tasks

pepijnve force-pushed the issue_19286 branch from 8cdddb0 to deb73a9 Compare December 11, 2025 16:25

pepijnve marked this pull request as ready for review December 11, 2025 16:25

pepijnve mentioned this pull request Dec 11, 2025

Grouped aggregations with many distinct groups do not respect memory limit when input is sorted #19286

Closed

alamb mentioned this pull request Dec 11, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-12-08 #19210

Open

40 tasks

pepijnve marked this pull request as draft December 12, 2025 16:27

pepijnve changed the title ~~Respect memory pool size when in GroupedHashAggregateStream when spilling is not possible~~ Implement disk spilling for all grouping modes in GroupedHashAggregateStream Dec 12, 2025

pepijnve marked this pull request as ready for review December 12, 2025 17:02

pepijnve changed the title ~~Implement disk spilling for all grouping modes in GroupedHashAggregateStream~~ Implement disk spilling for all grouping ordering modes in GroupedHashAggregateStream Dec 12, 2025

pepijnve force-pushed the issue_19286 branch from 024f425 to d9af260 Compare December 12, 2025 17:08

pepijnve commented Dec 14, 2025

View reviewed changes

pepijnve commented Dec 15, 2025

View reviewed changes

alamb mentioned this pull request Dec 15, 2025

Minor: Make ProjectionExpr::new easier to use with constants #19343

Merged

alamb reviewed Dec 15, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 18, 2025

pepijnve force-pushed the issue_19286 branch 2 times, most recently from 4fdb8f0 to f99ba14 Compare December 18, 2025 09:51

pepijnve added 9 commits December 18, 2025 19:09

apache#19286 Respect memory pool size when in GroupedHashAggregateStr…

2c655f0

…eam when spilling is not possible

apache#19286 Add spilling support for (partially) sorted grouped aggr…

7152b8f

…egations

Attempt to clarify OOM handling logic

e2d05c5

Refactor find_sort_options to LexOrdering::get_sort_options

37ad033

Merge nested match expressions that determine oom_mode.

6f714f4

Improve 'emit early' oom mode

9c23357

- Report 'can spill: true' for emit early mode - Emit fewer than 'batch size' group values if necessary

Formatting and lint fixes

3a93c5f

Only use OutOfMemoryMode::Spill if the DiskManager is enabled

6a1e7ea

Use AggregateMode::Single in test_oom to force reporting of the o…

1cc05a7

…ut-of-memory error

pepijnve force-pushed the issue_19286 branch from 0a3b96f to 1cc05a7 Compare December 18, 2025 18:10

kazuyukitanimura reviewed Dec 18, 2025

View reviewed changes

Add more comments in set_input_done_and_produce_output

2ea0e3d

alamb approved these changes Dec 19, 2025

View reviewed changes

apache deleted a comment from alamb-ghbot Dec 19, 2025

alamb added this pull request to the merge queue Dec 20, 2025

Merged via the queue into apache:main with commit eb30c19 Dec 20, 2025
31 checks passed

Implement disk spilling for all grouping ordering modes in GroupedHashAggregateStream #19287

Implement disk spilling for all grouping ordering modes in GroupedHashAggregateStream #19287

Uh oh!

Conversation

pepijnve commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pepijnve commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pepijnve commented Dec 11, 2025

Uh oh!

pepijnve commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb-ghbot commented Dec 15, 2025

Uh oh!

alamb-ghbot commented Dec 15, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

alamb-ghbot commented Dec 16, 2025

Uh oh!

pepijnve commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

pepijnve commented Dec 16, 2025

Uh oh!

pepijnve commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pepijnve commented Dec 18, 2025

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pepijnve commented Dec 11, 2025 •

edited

Loading

pepijnve commented Dec 11, 2025 •

edited

Loading

pepijnve commented Dec 18, 2025 •

edited

Loading

pepijnve Dec 19, 2025 •

edited

Loading

pepijnve Dec 19, 2025 •

edited

Loading

pepijnve Dec 19, 2025 •

edited

Loading

alamb commented Dec 19, 2025 •

edited

Loading