Skip to content

Conversation

@pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Dec 11, 2025

Which issue does this PR close?

Rationale for this change

GroupedHashAggregateStream currently always reports that it can spill to the memory tracking subsystem even though this is dependent on the aggregation mode and the grouping order.
The optimistic logic in group_aggregate_batch does not correctly take the spilling preconditions into account which can lead to excessive memory use.

In order to to resolve this, this PR implements disk spilling for all grouping modes.

What changes are included in this PR?

  • Correctly set MemoryConsumer::can_spill to reflect actual spilling behaviour
  • Ensure optimistic out-of-memory tolerance in group_aggregate_batch is aligned with disk spilling or early emission logic
  • Implement output order respecting disk spilling for partially and fully sorted inputs.

Are these changes tested?

Added additional test case to demonstrate problem.
Added test case to check that output order is respected after spilling.

Are there any user-facing changes?

Yes, memory exhaustion may be reported much earlier in the query pipeline than is currently the case. In my local tests with a per consumer memory limit of 32MiB, grouped aggregation would consume 480MiB in practice. This was then reported by ExternalSortExec which choked on trying to reserve that much memory.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 11, 2025
@pepijnve
Copy link
Contributor Author

pepijnve commented Dec 11, 2025

wasm test failure seems unrelated. seems to have been a fluke

@pepijnve
Copy link
Contributor Author

Extended test failures are related to the changes in this PR. Will review as soon as I can.

@pepijnve pepijnve marked this pull request as draft December 12, 2025 16:27
@pepijnve pepijnve changed the title Respect memory pool size when in GroupedHashAggregateStream when spilling is not possible Implement disk spilling for all grouping modes in GroupedHashAggregateStream Dec 12, 2025
@pepijnve pepijnve marked this pull request as ready for review December 12, 2025 17:02
@pepijnve pepijnve changed the title Implement disk spilling for all grouping modes in GroupedHashAggregateStream Implement disk spilling for all grouping ordering modes in GroupedHashAggregateStream Dec 12, 2025
@pepijnve
Copy link
Contributor Author

I've extended this PR to include spill support for grouping by partial or fully sorted inputs. This fixed the previously failing tests.


let mut new_requirements = new_sort_exprs
.into_iter()
.map(PhysicalSortRequirement::from)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating new sort requirements with default options, this copies the sort options from the input instead.

let group_ordering = GroupOrdering::try_new(&agg.input_order_mode)?;
let oom_mode = match group_ordering {
GroupOrdering::None => {
if agg.mode == AggregateMode::Partial {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention here is to keep the earlier emit early behaviour for AggregateMode::Partial.

Comment on lines 677 to 695
fn find_sort_options(
output_ordering: Option<&LexOrdering>,
expr: &dyn PhysicalExpr,
) -> SortOptions {
if let Some(ordering) = output_ordering {
for e in ordering {
if e.expr.as_ref().dyn_eq(expr) {
return e.options;
}
}
}

SortOptions::default()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't 100% sure about this implementation. Perhaps there's a better (more elegant) way to figure out what the correct SortOptions are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a method to LexOrdering -- so this could be like

LexOrdering::find_options(expr).unwrap_or_default()

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pepijnve -- this looks very close to me . I reviewed the code carefully and the only thing I have a concern about is spilling the extra batch. Otherwise this looks good to me.

I will kick off some benchmarks just to me sure

cc @rluvaton and @2010YOUY01 who I think have also been working in this code recently

}

#[tokio::test]
async fn test_grouped_aggregation_respects_memory_limit() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified these tests cover the new code as they fail without the code in this PR

failures:

---- aggregates::tests::test_grouped_aggregation_respects_memory_limit stdout ----

thread 'aggregates::tests::test_grouped_aggregation_respects_memory_limit' (25899629) panicked at datafusion/physical-plan/src/aggregates/mod.rs:3384:17:
Expected spill but SpillCount metric not found or SpillCount was 0.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- aggregates::tests::test_order_is_retained_when_spilling stdout ----

thread 'aggregates::tests::test_order_is_retained_when_spilling' (25899632) panicked at datafusion/physical-plan/src/aggregates/mod.rs:3384:17:
Expected spill but SpillCount metric not found or SpillCount was 0.



failures:
    aggregates::tests::test_grouped_aggregation_respects_memory_limit
    aggregates::tests::test_order_is_retained_when_spilling


// aggregation operator, we must use a schema that includes both the group
// columns **and** the partial-state columns.
let partial_agg_schema = create_schema(
let spill_schema = Arc::new(create_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this is used for spilling (per the comments) renaming these variables makes sense to me

let output_expr = Column::new(field.name().as_str(), idx);

// Try to use the sort options from the output ordering, if available.
// This ensures that spilled state is emitted in the expected order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 677 to 695
fn find_sort_options(
output_ordering: Option<&LexOrdering>,
expr: &dyn PhysicalExpr,
) -> SortOptions {
if let Some(ordering) = output_ordering {
for e in ordering {
if e.expr.as_ref().dyn_eq(expr) {
return e.options;
}
}
}

SortOptions::default()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a method to LexOrdering -- so this could be like

LexOrdering::find_options(expr).unwrap_or_default()

}

/// Determines if `spill_state_if_oom` can free up memory by spilling state to disk
fn can_spill_on_oom(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit would be to move this closer in the code to can_emit_early_on_oom so they are closer together in the code

Comment on lines 1201 to 1178
// Spill the last remaining rows (if any) to free up as much memory as possible.
// Since we're already spilling, we can be sure we're memory constrained.
// Creating an extra spill file won't make much of a difference.
self.spill()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sort of depends on how large the spill file was -- like if we have 2GB of data in memorying, writing it to a new spill file will likely make a measurable negative difference in performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had my doubts about this one myself due to performance regression concerns. On the other hand, the previous code was kind of cheating in the sense that the last batch is held in memory, and unless I'm mistaken, is no longer accounted for in the memory reservation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is how sort exec does - always spill when have in memory and about to do external sort

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- this approach makes sense for now. We can try and optimize it in the future if needed

// clear up memory for streaming_merge
self.clear_all();
self.update_memory_reservation()?;
let mut streams: Vec<SendableRecordBatchStream> = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is because the code now forces a spill even if it had memory to hold the last batch in memory, right?

I feel like while the code is now cleaner we have lost some performance

Maybe we could factor the creation of the stream into a method so it looks something like this

if let Some(stream) = self.emit_to_stream(EmitTo:All, true) { 
 builder = builder.with_streams(vec![stream]));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. By spilling the last batch the special handling is no longer required. Clean code wasn't really the main driver though. Accounting for batch in the memory reservation was.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

run benchmarks

@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

run benchmark external_aggr

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (d9af260) to e914935 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and issue_19286
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_19286 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2768.36 ms │  2781.27 ms │ no change │
│ QQuery 1     │  1210.78 ms │  1206.99 ms │ no change │
│ QQuery 2     │  2327.35 ms │  2416.15 ms │ no change │
│ QQuery 3     │  1183.49 ms │  1180.56 ms │ no change │
│ QQuery 4     │  2325.06 ms │  2298.98 ms │ no change │
│ QQuery 5     │ 29417.81 ms │ 28624.36 ms │ no change │
│ QQuery 6     │  4148.20 ms │  4091.45 ms │ no change │
│ QQuery 7     │  3650.20 ms │  3630.43 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 47031.25ms │
│ Total Time (issue_19286)   │ 46230.18ms │
│ Average Time (HEAD)        │  5878.91ms │
│ Average Time (issue_19286) │  5778.77ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          8 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_19286 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.72 ms │     2.58 ms │ +1.06x faster │
│ QQuery 1     │    50.10 ms │    51.38 ms │     no change │
│ QQuery 2     │   138.01 ms │   136.19 ms │     no change │
│ QQuery 3     │   157.51 ms │   160.57 ms │     no change │
│ QQuery 4     │  1098.11 ms │  1119.95 ms │     no change │
│ QQuery 5     │  1500.24 ms │  1539.26 ms │     no change │
│ QQuery 6     │     2.32 ms │     2.31 ms │     no change │
│ QQuery 7     │    56.13 ms │    55.90 ms │     no change │
│ QQuery 8     │  1427.96 ms │  1490.07 ms │     no change │
│ QQuery 9     │  1875.95 ms │  1981.70 ms │  1.06x slower │
│ QQuery 10    │   378.53 ms │   362.97 ms │     no change │
│ QQuery 11    │   410.78 ms │   411.68 ms │     no change │
│ QQuery 12    │  1344.05 ms │  1409.03 ms │     no change │
│ QQuery 13    │  2038.46 ms │  2032.92 ms │     no change │
│ QQuery 14    │  1263.64 ms │  1304.08 ms │     no change │
│ QQuery 15    │  1257.14 ms │  1228.76 ms │     no change │
│ QQuery 16    │  2647.13 ms │  2698.07 ms │     no change │
│ QQuery 17    │  2641.55 ms │  2717.83 ms │     no change │
│ QQuery 18    │  4998.71 ms │  4972.88 ms │     no change │
│ QQuery 19    │   124.97 ms │   125.49 ms │     no change │
│ QQuery 20    │  1930.71 ms │  1928.07 ms │     no change │
│ QQuery 21    │  2261.38 ms │  2211.14 ms │     no change │
│ QQuery 22    │  3892.42 ms │  3826.63 ms │     no change │
│ QQuery 23    │ 12536.42 ms │ 12495.60 ms │     no change │
│ QQuery 24    │   219.14 ms │   205.82 ms │ +1.06x faster │
│ QQuery 25    │   460.07 ms │   462.27 ms │     no change │
│ QQuery 26    │   225.23 ms │   220.45 ms │     no change │
│ QQuery 27    │  2815.01 ms │  2776.16 ms │     no change │
│ QQuery 28    │ 24048.63 ms │ 23714.54 ms │     no change │
│ QQuery 29    │  1009.00 ms │  1006.84 ms │     no change │
│ QQuery 30    │  1343.16 ms │  1329.69 ms │     no change │
│ QQuery 31    │  1379.58 ms │  1335.89 ms │     no change │
│ QQuery 32    │  4738.56 ms │  4639.03 ms │     no change │
│ QQuery 33    │  5915.69 ms │  6034.39 ms │     no change │
│ QQuery 34    │  6102.79 ms │  6002.09 ms │     no change │
│ QQuery 35    │  2032.87 ms │  1890.81 ms │ +1.08x faster │
│ QQuery 36    │   120.08 ms │   119.18 ms │     no change │
│ QQuery 37    │    56.33 ms │    57.36 ms │     no change │
│ QQuery 38    │   123.82 ms │   119.64 ms │     no change │
│ QQuery 39    │   199.18 ms │   190.88 ms │     no change │
│ QQuery 40    │    47.56 ms │    44.04 ms │ +1.08x faster │
│ QQuery 41    │    42.18 ms │    42.97 ms │     no change │
│ QQuery 42    │    36.59 ms │    37.29 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 94950.38ms │
│ Total Time (issue_19286)   │ 94494.42ms │
│ Average Time (HEAD)        │  2208.15ms │
│ Average Time (issue_19286) │  2197.54ms │
│ Queries Faster             │          4 │
│ Queries Slower             │          1 │
│ Queries with No Change     │         38 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_19286 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 134.20 ms │   136.34 ms │     no change │
│ QQuery 2     │  28.81 ms │    30.21 ms │     no change │
│ QQuery 3     │  40.18 ms │    40.19 ms │     no change │
│ QQuery 4     │  29.77 ms │    30.70 ms │     no change │
│ QQuery 5     │  89.80 ms │    90.01 ms │     no change │
│ QQuery 6     │  20.38 ms │    20.15 ms │     no change │
│ QQuery 7     │ 237.37 ms │   230.91 ms │     no change │
│ QQuery 8     │  36.99 ms │    38.81 ms │     no change │
│ QQuery 9     │ 109.06 ms │   107.79 ms │     no change │
│ QQuery 10    │  65.23 ms │    65.33 ms │     no change │
│ QQuery 11    │  19.01 ms │    18.58 ms │     no change │
│ QQuery 12    │  52.76 ms │    51.70 ms │     no change │
│ QQuery 13    │  48.84 ms │    50.40 ms │     no change │
│ QQuery 14    │  14.48 ms │    15.95 ms │  1.10x slower │
│ QQuery 15    │  24.77 ms │    24.87 ms │     no change │
│ QQuery 16    │  27.06 ms │    26.55 ms │     no change │
│ QQuery 17    │ 164.84 ms │   153.12 ms │ +1.08x faster │
│ QQuery 18    │ 281.56 ms │   285.74 ms │     no change │
│ QQuery 19    │  38.84 ms │    39.09 ms │     no change │
│ QQuery 20    │  50.59 ms │    50.12 ms │     no change │
│ QQuery 21    │ 326.03 ms │   327.57 ms │     no change │
│ QQuery 22    │  18.84 ms │    18.21 ms │     no change │
└──────────────┴───────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1859.43ms │
│ Total Time (issue_19286)   │ 1852.34ms │
│ Average Time (HEAD)        │   84.52ms │
│ Average Time (issue_19286) │   84.20ms │
│ Queries Faster             │         1 │
│ Queries Slower             │         1 │
│ Queries with No Change     │        20 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

@alamb-ghbot

This comment was marked as outdated.

2 similar comments
@alamb-ghbot

This comment was marked as outdated.

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (d9af260) to e914935 diff using: external_aggr
Results will be posted here when complete

@pepijnve
Copy link
Contributor Author

Benchmark runner seems to be struggling

@alamb
Copy link
Contributor

alamb commented Dec 16, 2025

Benchmark runner seems to be struggling

It seems that the benchmark doesn't complete on main (I haven't tried on this branch)

     Running `/Users/andrewlamb/Software/datafusion/target/release/external_aggr benchmark --partitions 4 --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/tpch_sf1 -o /Users/andrewlamb/Software/datafusion/benchmarks/results/main/external_aggr.json`
Q1(64.0 MB) iteration 0 took 66.5 ms and returned 1 rows
Q1(64.0 MB) iteration 1 took 40.4 ms and returned 1 rows
Q1(64.0 MB) iteration 2 took 41.3 ms and returned 1 rows
Q1(64.0 MB) iteration 3 took 43.1 ms and returned 1 rows
Q1(64.0 MB) iteration 4 took 40.6 ms and returned 1 rows
Q1(64.0 MB) avg time: 46.38 ms
Error: Shared(ResourcesExhausted("Failed to allocate additional 1850.4 KB for GroupedHashAggregateStream[1] () with 1812.6 KB already allocated for this reservation - 2.7 MB remain available for the total pool"))

@pepijnve
Copy link
Contributor Author

Interesting. I’ll have a look at that tomorrow.

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 18, 2025
@pepijnve pepijnve force-pushed the issue_19286 branch 2 times, most recently from 4fdb8f0 to f99ba14 Compare December 18, 2025 09:51
@pepijnve
Copy link
Contributor Author

pepijnve commented Dec 18, 2025

@alamb 9f2f22e makes the external_aggr benchmark succeed.

Because I was setting can spill: false for OutOfMemoryMode::EmitEarly, there was in effect still greedy memory reservation despite the use of FairSpillPool. That was causing one of the streams to not be able to reserve even a minuscule amount of memory.

Additionally the 'emit early' code would only kick in if there was at least one full batch of group values to emit. I've modified the code to still try to do that if it can, but fall back to emitting a smaller batch if memory pressure requires it.

@pepijnve
Copy link
Contributor Author

One remaining question I have is how skip_aggregation_probe and group_values_soft_limit are expected to interact with the memory reservation system.

In SkipAggregationProbe::update_state the total number of input rows is accumulated, but the only the current number of group values is compared against that. If disk spilling or early emission is in effect, then the consequence is that the ratio will only decrease since the denominator of the ratio keeps on increasing. In other words, it's unlikely to ever kick in.

The same goes for group_values_soft_limit. It's compared against the number of group values that are held in memory by the aggregation stream itself. If we keep flushing values either down the pipeline or to disk, it's also unlikely to ever trigger.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pepijnve

Comment on lines 1170 to 1173
/// At this point, all the inputs are read and there are some spills.
/// Emit the remaining rows and create a batch.
/// Conduct a streaming merge sort between the batch and spilled data. Since the stream is fully
/// sorted, set `self.group_ordering` to Full, then later we can read with [`EmitTo::First`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this comment because we now always spill the remaining rows?

Also, is there any way to measure the impact of this change in a micro benchmark?

Copy link
Contributor Author

@pepijnve pepijnve Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yes, I'll revise that comment a bit.

I've made some adjustments that allow benchmarks/src/bin/external_aggr.rs to run to completion again (it was broken on main). I haven't inspected the query plans it uses yet to see which code path it's actually covering.

Because it wasn't working on main it's not possible to use it for comparative testing.

Copy link
Contributor Author

@pepijnve pepijnve Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to inline update_merged_stream into set_input_done_and_produce_output. I couldn't come up with a sensible name for it, and found it to be clearer if the two alternatives in set_input_done_and_produce_output are close together.

I've added additional comments in set_input_done_and_produce_output to explain what's being done there and why.

Copy link
Contributor Author

@pepijnve pepijnve Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In external_aggr.rs, the query plan for Q1 is

ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
  AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
    CoalescePartitionsExec
      AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
        ProjectionExec: expr=[]
          AggregateExec: mode=FinalPartitioned, gby=[l_orderkey@0 as l_orderkey, l_suppkey@1 as l_suppkey], aggr=[]
            RepartitionExec: partitioning=Hash([l_orderkey@0, l_suppkey@1], 4), input_partitions=4
              AggregateExec: mode=Partial, gby=[l_orderkey@0 as l_orderkey, l_suppkey@1 as l_suppkey], aggr=[]
                DataSourceExec: ...

I've confirmed with the debugger that a mix of OutOfMemoryMode::EmitEarly and OutOfMemoryMode::Spill is used. So that benchmark is definitely useful to measure the performance impact of changes in this area of the code.

@alamb Is it worth trying to get it in working state again on main with only minimal changes first? My appetite for that work is rather limited unless it's considered a blocker for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Is it worth trying to get it in working state again on main with only minimal changes first? My appetite for that work is rather limited unless it's considered a blocker for this PR.

I don't think so. If we hit issues we can always make a new PR

@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

I think this one is ready to go. I'll just update it once more and run benchmarks one more time to make me feel better and plan to merge it in

@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

run benchmark clickbench_partitioned

@apache apache deleted a comment from alamb-ghbot Dec 19, 2025
@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (2ea0e3d) to ead8209 diff using: clickbench_partitioned
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

run benchmark tpch

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and issue_19286
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_19286 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.29 ms │     2.54 ms │  1.11x slower │
│ QQuery 1     │    49.15 ms │    49.71 ms │     no change │
│ QQuery 2     │   135.30 ms │   131.44 ms │     no change │
│ QQuery 3     │   158.94 ms │   152.10 ms │     no change │
│ QQuery 4     │  1077.08 ms │  1094.73 ms │     no change │
│ QQuery 5     │  1504.96 ms │  1499.92 ms │     no change │
│ QQuery 6     │     2.17 ms │     2.10 ms │     no change │
│ QQuery 7     │    55.89 ms │    55.15 ms │     no change │
│ QQuery 8     │  1444.52 ms │  1446.30 ms │     no change │
│ QQuery 9     │  1867.87 ms │  1866.52 ms │     no change │
│ QQuery 10    │   362.73 ms │   362.69 ms │     no change │
│ QQuery 11    │   415.56 ms │   423.89 ms │     no change │
│ QQuery 12    │  1348.89 ms │  1347.85 ms │     no change │
│ QQuery 13    │  2007.19 ms │  2053.80 ms │     no change │
│ QQuery 14    │  1286.53 ms │  1277.48 ms │     no change │
│ QQuery 15    │  1266.86 ms │  1283.56 ms │     no change │
│ QQuery 16    │  2680.37 ms │  2685.34 ms │     no change │
│ QQuery 17    │  2667.04 ms │  2684.80 ms │     no change │
│ QQuery 18    │  5779.79 ms │  4956.34 ms │ +1.17x faster │
│ QQuery 19    │   123.39 ms │   122.35 ms │     no change │
│ QQuery 20    │  1979.79 ms │  1893.95 ms │     no change │
│ QQuery 21    │  2249.46 ms │  2178.91 ms │     no change │
│ QQuery 22    │  3832.69 ms │  3714.92 ms │     no change │
│ QQuery 23    │ 18395.94 ms │ 12136.24 ms │ +1.52x faster │
│ QQuery 24    │   215.60 ms │   212.90 ms │     no change │
│ QQuery 25    │   463.47 ms │   452.82 ms │     no change │
│ QQuery 26    │   227.38 ms │   226.94 ms │     no change │
│ QQuery 27    │  2828.96 ms │  2760.40 ms │     no change │
│ QQuery 28    │ 25164.08 ms │ 21938.93 ms │ +1.15x faster │
│ QQuery 29    │   964.94 ms │   974.72 ms │     no change │
│ QQuery 30    │  1364.88 ms │  1316.22 ms │     no change │
│ QQuery 31    │  1356.24 ms │  1322.71 ms │     no change │
│ QQuery 32    │  5098.17 ms │  5120.02 ms │     no change │
│ QQuery 33    │  6036.11 ms │  5869.18 ms │     no change │
│ QQuery 34    │  6409.47 ms │  6356.96 ms │     no change │
│ QQuery 35    │  1986.02 ms │  1939.25 ms │     no change │
│ QQuery 36    │    66.26 ms │    70.39 ms │  1.06x slower │
│ QQuery 37    │    45.43 ms │    44.82 ms │     no change │
│ QQuery 38    │    67.59 ms │    66.48 ms │     no change │
│ QQuery 39    │   105.42 ms │   100.53 ms │     no change │
│ QQuery 40    │    27.78 ms │    26.24 ms │ +1.06x faster │
│ QQuery 41    │    24.04 ms │    23.77 ms │     no change │
│ QQuery 42    │    21.02 ms │    19.77 ms │ +1.06x faster │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 103167.27ms │
│ Total Time (issue_19286)   │  92265.70ms │
│ Average Time (HEAD)        │   2399.24ms │
│ Average Time (issue_19286) │   2145.71ms │
│ Queries Faster             │           5 │
│ Queries Slower             │           2 │
│ Queries with No Change     │          36 │
│ Queries with Failure       │           0 │
└────────────────────────────┴─────────────┘

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_19286 (2ea0e3d) to ead8209 diff using: tpch
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and issue_19286
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_19286 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 221.41 ms │   219.18 ms │ no change │
│ QQuery 2     │  97.81 ms │    96.65 ms │ no change │
│ QQuery 3     │ 130.09 ms │   125.76 ms │ no change │
│ QQuery 4     │  77.41 ms │    75.86 ms │ no change │
│ QQuery 5     │ 178.62 ms │   171.12 ms │ no change │
│ QQuery 6     │  64.82 ms │    67.91 ms │ no change │
│ QQuery 7     │ 223.95 ms │   220.97 ms │ no change │
│ QQuery 8     │ 164.51 ms │   166.09 ms │ no change │
│ QQuery 9     │ 225.75 ms │   228.03 ms │ no change │
│ QQuery 10    │ 190.29 ms │   187.33 ms │ no change │
│ QQuery 11    │  75.81 ms │    79.26 ms │ no change │
│ QQuery 12    │ 117.43 ms │   118.47 ms │ no change │
│ QQuery 13    │ 220.44 ms │   226.54 ms │ no change │
│ QQuery 14    │  93.13 ms │    90.60 ms │ no change │
│ QQuery 15    │ 124.64 ms │   122.01 ms │ no change │
│ QQuery 16    │  58.59 ms │    59.09 ms │ no change │
│ QQuery 17    │ 279.57 ms │   278.04 ms │ no change │
│ QQuery 18    │ 319.34 ms │   322.98 ms │ no change │
│ QQuery 19    │ 134.92 ms │   137.04 ms │ no change │
│ QQuery 20    │ 126.04 ms │   125.36 ms │ no change │
│ QQuery 21    │ 267.19 ms │   261.86 ms │ no change │
│ QQuery 22    │  43.41 ms │    44.01 ms │ no change │
└──────────────┴───────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 3435.15ms │
│ Total Time (issue_19286)   │ 3424.14ms │
│ Average Time (HEAD)        │  156.14ms │
│ Average Time (issue_19286) │  155.64ms │
│ Queries Faster             │         0 │
│ Queries Slower             │         0 │
│ Queries with No Change     │        22 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

@alamb alamb added this pull request to the merge queue Dec 20, 2025
@alamb
Copy link
Contributor

alamb commented Dec 20, 2025

Thanks again @pepijnve and @kazuyukitanimura

Merged via the queue into apache:main with commit eb30c19 Dec 20, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Grouped aggregations with many distinct groups do not respect memory limit when input is sorted

5 participants