Fix panic for `GROUPING SETS(())` and handle empty-grouping aggregates #19252

kosiew · 2025-12-10T09:28:25Z

Which issue does this PR close?

Closes Invalid GROUPING SETS caused panic #18974.

Rationale for this change

The DataFusion CLI currently panics with an "index out of bounds" error when executing queries that use GROUP BY GROUPING SETS(()), such as:

SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())

This panic originates in the physical aggregation code, which assumes that an empty list of grouping expressions always corresponds to "no grouping". That assumption breaks down in the presence of GROUPING SETS, where an empty set is a valid grouping set that should still produce a result row (and __grouping_id) rather than crashing.

This PR fixes the panic by explicitly distinguishing:

true "no GROUP BY" aggregations, and
GROUPING SETS/CUBE/ROLLUP plans that may have empty grouping expressions but still require grouping-set semantics and a valid __grouping_id.

The change restores robustness of the CLI and ensures standards-compliant behavior for grouping sets with empty sets.

What changes are included in this PR?

Summary of the main changes:

Track grouping-set usage explicitly in PhysicalGroupBy:
- Add a has_grouping_set: bool field to PhysicalGroupBy.
- Extend PhysicalGroupBy::new to accept the has_grouping_set flag.
- Add helper methods:
  - has_grouping_set(&self) -> bool to expose the flag, and
  - is_true_no_grouping(&self) -> bool to represent the case of genuinely no grouping (no GROUP BY and no grouping sets).
Correct group state construction for empty grouping with grouping sets:
- Update PhysicalGroupBy::from_pre_group so that it only treats expr.is_empty() as "no groups" when has_grouping_set is false.
- For GROUPING SETS(()), we now build at least one group, avoiding the previous out-of-bounds access on groups[0].
Clarify when __grouping_id should be present:
- Replace the previous is_single logic with a clearer distinction based on has_grouping_set.
- num_output_exprs, output_exprs, num_group_exprs, and group_schema now add the __grouping_id column only when has_grouping_set is true.
- is_single is redefined as "simple GROUP BY" (no grouping sets), i.e. !self.has_grouping_set.
Integrate the new semantics into AggregateExec:
- Use group_by.is_true_no_grouping() instead of group_by.expr.is_empty() when choosing between the specialized no-grouping aggregation path and grouped aggregation.
- Ensure that is_unordered_unfiltered_group_by_distinct only treats plans as grouped when there are grouping expressions and no grouping sets (!has_grouping_set).
- Preserve existing behavior for regular GROUP BY while correctly handling GROUPING SETS and related constructs.
Support __grouping_id with the no-grouping aggregation stream:
- Extend AggregateStreamInner with an optional grouping_id: Option<ScalarValue> field.
- Change AggregateStream::new to accept a grouping_id argument.
- Introduce prepend_grouping_id_column to prepend a __grouping_id column to the finalized accumulator output when needed.
- Wire this up so that no-grouping aggregations can still match a schema that includes __grouping_id in grouping-set scenarios.
Planner and execution wiring updates:
- Update all PhysicalGroupBy::new call sites to pass the correct has_grouping_set value:
  - false for:
    - ordinary GROUP BY or truly no-grouping aggregates.
  - true for:
    - GROUPING SETS,
    - CUBE, and
    - ROLLUP physical planning paths.
- Ensure merge_grouping_set_physical_expr, create_cube_physical_expr, and create_rollup_physical_expr correctly mark grouping-set plans.
Protobuf / physical plan round-trip support:
- Extend AggregateExecNode in datafusion.proto with a new bool has_grouping_set = 12; field.
- Update the generated pbjson and prost code to serialize and deserialize the new field.
- When constructing AggregateExec from protobuf, pass the decoded has_grouping_set into PhysicalGroupBy::new.
- When serializing an AggregateExec back to protobuf, set has_grouping_set based on exec.group_expr().has_grouping_set().
- Update round-trip physical plan tests to include the new field in their expectations.
Tests and SQL logic coverage:
- Add sqllogictests for the previously failing cases in grouping.slt:
  - SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (());
  - SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(()) (the original panic case).
- Extend or adjust unit tests in aggregates, physical_planner, filter_pushdown, and coop modules to account for the has_grouping_set flag in PhysicalGroupBy and expected debug output.
- Update proto round-trip tests to validate has_grouping_set is preserved.

Are these changes tested?

Yes.

New sqllogictests covering GROUPING SETS(()) for both a regular table and generate_series(10):
- grouping.slt now asserts the expected scalar results (e.g. 2 and 55), preventing regressions on this edge case.
Updated and existing Rust unit tests:
- physical-plan/src/aggregates tests updated to include has_grouping_set in PhysicalGroupBy expectations.
- Planner and optimizer tests (e.g. physical_planner.rs, filter_pushdown) updated to construct PhysicalGroupBy with the new flag.
- Execution tests in core/tests/execution/coop.rs updated to reflect the new constructor and continue to exercise the no-grouping aggregation path.
- Protobuf round-trip tests extended to verify that has_grouping_set is correctly serialized and deserialized.

These tests collectively ensure that:

the panic is fixed,
the aggregation semantics for GROUPING SETS(()) are correct, and
existing aggregate behavior remains unchanged for non-grouping-set queries.

Are there any user-facing changes?

Yes, but they are bug fixes and behavior clarifications rather than breaking changes:

Queries using GROUP BY GROUPING SETS(()) no longer cause a runtime panic in the DataFusion CLI.
- Instead, they return the expected single aggregate row (e.g. COUNT(*) or SUM(v1)), consistent with SQL semantics.
For plans using GROUPING SETS, CUBE, or ROLLUP, the internal __grouping_id column is now present consistently whenever grouping sets are in use, even when the grouping expressions are empty.
For ordinary GROUP BY queries that do not use grouping sets, behavior is unchanged: no unexpected __grouping_id column is added.

No API signatures were changed in a breaking way for downstream users; the additions are internal flags and protobuf fields to accurately represent the physical plan.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Implement grouping ID defaults and final-stage handling for grouping sets, ensuring they expose __grouping_id in schemas, even without explicit columns. Modify no-grouping aggregate streams to append the grouping ID column as necessary, maintaining batch output alignment with grouping-set schemas. Propagate the new grouping-set flag through the physical planner and protobuf serialization, and add a regression test for GROUPING SETS(()).

Introduce has_grouping_set flag in PhysicalGroupBy to ensure grouping ID columns are generated and grouping-set execution paths are honored when no grouping expressions are present. Propagate the grouping-set flag through physical planning and protobuf serialization to maintain __grouping_id metadata. Add a regression test for GROUPING SETS(()) to verify aggregate schema and execution count.

…Stream

… grouping ID usage

…ify grouping ID usage

… in AggregateStream

…hanges

… unnecessary line

ethan-tyler

Thanks for working on this @kosiew! The approach of tracking has_grouping_set explicitly LGTM.

Found one issue with the statistics path that I think needs fixing, and had a question
about some of the new code in no_grouping.rs.

Tested locally, SQL logic tests and aggregate tests pass.

ethan-tyler · 2025-12-10T19:01:08Z

datafusion/physical-plan/src/aggregates/mod.rs

        match self.mode {
            AggregateMode::Final | AggregateMode::FinalPartitioned
-                if self.group_by.expr.is_empty() =>
+                if self.group_by.is_true_no_grouping() =>


This should probably be self.group_by.expr.is_empty() instead.

GROUPING SETS(()) still produces exactly one row (it's aggregating everything),
but is_true_no_grouping() returns false for it since has_grouping_set is true.
So we'd fall through to the inexact branch and report the wrong cardinality.

Suggested change

if self.group_by.is_true_no_grouping() =>

if self.group_by.expr.is_empty() =>

Verified the fix - expr.is_empty() is correct here since GROUPING SETS(())
produces one row regardless of has_grouping_set.

ethan-tyler · 2025-12-10T19:28:58Z

datafusion/physical-plan/src/aggregates/no_grouping.rs

        agg: &AggregateExec,
        context: &Arc<TaskContext>,
        partition: usize,
+        grouping_id: Option<ScalarValue>,


Looks like this is always None at the call site mod.rs:742 , is this scaffolding for
something planned, or can we drop it? If it's for future use, maybe add a
comment so nobody removes it thinking it's dead code.

I'll remove it

Parameter's gone from the signature, but prepend_grouping_id_column is still
being called with hardcoded None, which makes the whole .and_then() block
a no-op.

Not blocking, but could remove the function and that call entirely since
GROUPING SETS(()) goes through the grouped path anyway. Can be a follow-up
cleanup if you'd rather not touch it now.

ethan-tyler · 2025-12-10T19:30:04Z

datafusion/sqllogictest/test_files/grouping.slt

+
+# grouping_sets_with_empty_set
+query I
+SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (());
+----
+2
+
+# grouping_sets_with_empty_set
+query I
+SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())
+----
+55


This is good to have both COUNT and SUM tested here.

…xpr.is_empty()

ethan-tyler

Verified the changes:

Statistics cardinality fix correct
has_grouping_set flag properly wired through PhysicalGroupBy, planner,
and protobuf serialization
is_true_no_grouping() correctly distinguishes "no GROUP BY" from
"GROUPING SETS with empty expressions"
Test coverage solid - both COUNT and SUM cases for GROUPING SETS(())

One minor thing: prepend_grouping_id_column is now dead code since it's
always called with None. Not blocking - can clean up in a follow-up.

LGTM, nice fix.

kosiew · 2025-12-15T06:32:16Z

@ethan-tyler
Thanks for your review and feedback.

…pache#19252)

adriangb

Thanks @kosiew !

adriangb · 2025-12-15T14:29:33Z

datafusion/core/src/physical_planner.rs

        } else if group_expr.is_empty() {
            // No GROUP BY clause - create empty PhysicalGroupBy
-            Ok(PhysicalGroupBy::new(vec![], vec![], vec![]))
+            Ok(PhysicalGroupBy::new(vec![], vec![], vec![], false))


This constructor is now a bit hard to grok - could you decorate this call sight with a comment such as --no expressions, no null expressions and no grouping expressions

I'll add comments to clarify this.

adriangb · 2025-12-15T14:31:03Z

datafusion/physical-plan/src/aggregates/mod.rs

    /// Null mask for each group in this grouping set. Each group is
    /// composed of either one of the group expressions in expr or a null
    /// expression in null_expr. If `groups[i][j]` is true, then the
    /// j-th expression in the i-th group is NULL, otherwise it is `expr[j]`.
    groups: Vec<Vec<bool>>,
+    /// True when GROUPING SETS/CUBE/ROLLUP are used so `__grouping_id` should
+    /// be included in the output schema.
+    has_grouping_set: bool,


Would it be better to make groups: Option<Vec<Vec<bool>>> and give None the meaning of has_grouping_set: false and Some(vec![]) of true?

Thanks for the suggestion.

I think keeping a separate has_grouping_set: bool is actually the cleaner and more approachable design here, for a few reasons.

Clear intent
The flag says exactly what it means: should __grouping_id be part of the output schema? That intent is immediately obvious when reading the code, instead of having to infer behavior from the state of an Option.

Easier to reason about
With an explicit boolean, the invariant is simple:

true → grouping-set mode

false → simple GROUP BY

Using an Option would require interpreting things like Some(vec![]) vs Some(vec![vec![false]]).

More readable in practice
Checks like:

if self.has_grouping_set { ... }

are easier to scan and understand than repeatedly pattern-matching on an Option

Fits the existing logic nicely
The current from_pre_group() logic already reads very clearly:

let groups = if self.expr.is_empty() && !self.has_grouping_set { // No GROUP BY expressions - should have no groups vec![] } else { vec![vec![false; num_exprs]] };

With an Option-based approach, this becomes harder to follow and less explicit about why we’re making this distinction.

Plays well with protobufs
A simple bool maps cleanly to the proto definition and avoids extra conversion or interpretation logic between Rust Options and protobuf semantics.

Separation of concerns
groups describes the structure of grouping (which expressions are used where), while has_grouping_set independently captures whether grouping-set mode is enabled at all (which affects schema and grouping ID generation).

…sicalPlanner

kosiew · 2025-12-16T03:56:02Z

@adriangb
Thanks for your review and feedback.

kosiew added 8 commits December 10, 2025 17:21

feat: Enhance grouping functionality in PhysicalGroupBy and Aggregate…

506cb03

…Stream

feat: Improve documentation for AggregateStream constructor regarding…

0c4edb7

… grouping ID usage

refactor: Simplify AggregateStream constructor documentation and clar…

dabb3eb

…ify grouping ID usage

feat: Add function to prepend grouping ID column for schema alignment…

50dc435

… in AggregateStream

feat: Move grouping_sets test file to slt, update mod.rs to reflect c…

70f77b0

…hanges

feat: Add test case for grouping sets with an empty set in grouping.slt

88c4390

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate physical-plan Changes to the physical-plan crate labels Dec 10, 2025

kosiew added 2 commits December 10, 2025 20:01

clippy fix

f6f988d

fix: Simplify deserialization logic for AggregateExecNode by removing…

61260cd

… unnecessary line

kosiew marked this pull request as ready for review December 10, 2025 15:30

ethan-tyler reviewed Dec 10, 2025

View reviewed changes

kosiew added 2 commits December 11, 2025 11:30

fix: Update condition for no grouping check in AggregateExec to use e…

87de90b

…xpr.is_empty()

fix: Remove unused grouping_id parameter from AggregateStream::new

890cdf0

ethan-tyler approved these changes Dec 12, 2025

View reviewed changes

kosiew requested a review from 2010YOUY01 December 15, 2025 06:32

adriangb enabled auto-merge December 15, 2025 14:18

adriangb disabled auto-merge December 15, 2025 14:19

adriangb added a commit to pydantic/datafusion that referenced this pull request Dec 15, 2025

Fix panic for GROUPING SETS(()) and handle empty-grouping aggregates (a…

4f7dd0e

…pache#19252)

adriangb approved these changes Dec 15, 2025

View reviewed changes

fix: Clarify comment for empty PhysicalGroupBy creation in DefaultPhy…

b3e7654

…sicalPlanner

kosiew added this pull request to the merge queue Dec 16, 2025

Merged via the queue into apache:main with commit c53a448 Dec 16, 2025
31 checks passed

	if self.group_by.is_true_no_grouping() =>
	if self.group_by.expr.is_empty() =>

Fix panic for GROUPING SETS(()) and handle empty-grouping aggregates #19252

Fix panic for GROUPING SETS(()) and handle empty-grouping aggregates #19252

Uh oh!

Conversation

kosiew commented Dec 10, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

ethan-tyler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ethan-tyler left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew commented Dec 15, 2025

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix panic for `GROUPING SETS(())` and handle empty-grouping aggregates #19252

Fix panic for `GROUPING SETS(())` and handle empty-grouping aggregates #19252

ethan-tyler left a comment •

edited

Loading