Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Dec 10, 2025

Which issue does this PR close?

Rationale for this change

The DataFusion CLI currently panics with an "index out of bounds" error when executing queries that use GROUP BY GROUPING SETS(()), such as:

SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())

This panic originates in the physical aggregation code, which assumes that an empty list of grouping expressions always corresponds to "no grouping". That assumption breaks down in the presence of GROUPING SETS, where an empty set is a valid grouping set that should still produce a result row (and __grouping_id) rather than crashing.

This PR fixes the panic by explicitly distinguishing:

  • true "no GROUP BY" aggregations, and
  • GROUPING SETS/CUBE/ROLLUP plans that may have empty grouping expressions but still require grouping-set semantics and a valid __grouping_id.

The change restores robustness of the CLI and ensures standards-compliant behavior for grouping sets with empty sets.

What changes are included in this PR?

Summary of the main changes:

  • Track grouping-set usage explicitly in PhysicalGroupBy:

    • Add a has_grouping_set: bool field to PhysicalGroupBy.

    • Extend PhysicalGroupBy::new to accept the has_grouping_set flag.

    • Add helper methods:

      • has_grouping_set(&self) -> bool to expose the flag, and
      • is_true_no_grouping(&self) -> bool to represent the case of genuinely no grouping (no GROUP BY and no grouping sets).
  • Correct group state construction for empty grouping with grouping sets:

    • Update PhysicalGroupBy::from_pre_group so that it only treats expr.is_empty() as "no groups" when has_grouping_set is false.
    • For GROUPING SETS(()), we now build at least one group, avoiding the previous out-of-bounds access on groups[0].
  • Clarify when __grouping_id should be present:

    • Replace the previous is_single logic with a clearer distinction based on has_grouping_set.
    • num_output_exprs, output_exprs, num_group_exprs, and group_schema now add the __grouping_id column only when has_grouping_set is true.
    • is_single is redefined as "simple GROUP BY" (no grouping sets), i.e. !self.has_grouping_set.
  • Integrate the new semantics into AggregateExec:

    • Use group_by.is_true_no_grouping() instead of group_by.expr.is_empty() when choosing between the specialized no-grouping aggregation path and grouped aggregation.
    • Ensure that is_unordered_unfiltered_group_by_distinct only treats plans as grouped when there are grouping expressions and no grouping sets (!has_grouping_set).
    • Preserve existing behavior for regular GROUP BY while correctly handling GROUPING SETS and related constructs.
  • Support __grouping_id with the no-grouping aggregation stream:

    • Extend AggregateStreamInner with an optional grouping_id: Option<ScalarValue> field.
    • Change AggregateStream::new to accept a grouping_id argument.
    • Introduce prepend_grouping_id_column to prepend a __grouping_id column to the finalized accumulator output when needed.
    • Wire this up so that no-grouping aggregations can still match a schema that includes __grouping_id in grouping-set scenarios.
  • Planner and execution wiring updates:

    • Update all PhysicalGroupBy::new call sites to pass the correct has_grouping_set value:

      • false for:

        • ordinary GROUP BY or truly no-grouping aggregates.
      • true for:

        • GROUPING SETS,
        • CUBE, and
        • ROLLUP physical planning paths.
    • Ensure merge_grouping_set_physical_expr, create_cube_physical_expr, and create_rollup_physical_expr correctly mark grouping-set plans.

  • Protobuf / physical plan round-trip support:

    • Extend AggregateExecNode in datafusion.proto with a new bool has_grouping_set = 12; field.
    • Update the generated pbjson and prost code to serialize and deserialize the new field.
    • When constructing AggregateExec from protobuf, pass the decoded has_grouping_set into PhysicalGroupBy::new.
    • When serializing an AggregateExec back to protobuf, set has_grouping_set based on exec.group_expr().has_grouping_set().
    • Update round-trip physical plan tests to include the new field in their expectations.
  • Tests and SQL logic coverage:

    • Add sqllogictests for the previously failing cases in grouping.slt:

      • SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (());
      • SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(()) (the original panic case).
    • Extend or adjust unit tests in aggregates, physical_planner, filter_pushdown, and coop modules to account for the has_grouping_set flag in PhysicalGroupBy and expected debug output.

    • Update proto round-trip tests to validate has_grouping_set is preserved.

Are these changes tested?

Yes.

  • New sqllogictests covering GROUPING SETS(()) for both a regular table and generate_series(10):

    • grouping.slt now asserts the expected scalar results (e.g. 2 and 55), preventing regressions on this edge case.
  • Updated and existing Rust unit tests:

    • physical-plan/src/aggregates tests updated to include has_grouping_set in PhysicalGroupBy expectations.
    • Planner and optimizer tests (e.g. physical_planner.rs, filter_pushdown) updated to construct PhysicalGroupBy with the new flag.
    • Execution tests in core/tests/execution/coop.rs updated to reflect the new constructor and continue to exercise the no-grouping aggregation path.
    • Protobuf round-trip tests extended to verify that has_grouping_set is correctly serialized and deserialized.

These tests collectively ensure that:

  • the panic is fixed,
  • the aggregation semantics for GROUPING SETS(()) are correct, and
  • existing aggregate behavior remains unchanged for non-grouping-set queries.

Are there any user-facing changes?

Yes, but they are bug fixes and behavior clarifications rather than breaking changes:

  • Queries using GROUP BY GROUPING SETS(()) no longer cause a runtime panic in the DataFusion CLI.

    • Instead, they return the expected single aggregate row (e.g. COUNT(*) or SUM(v1)), consistent with SQL semantics.
  • For plans using GROUPING SETS, CUBE, or ROLLUP, the internal __grouping_id column is now present consistently whenever grouping sets are in use, even when the grouping expressions are empty.

  • For ordinary GROUP BY queries that do not use grouping sets, behavior is unchanged: no unexpected __grouping_id column is added.

No API signatures were changed in a breaking way for downstream users; the additions are internal flags and protobuf fields to accurately represent the physical plan.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Implement grouping ID defaults and final-stage handling for
grouping sets, ensuring they expose __grouping_id in
schemas, even without explicit columns. Modify no-grouping
aggregate streams to append the grouping ID column as
necessary, maintaining batch output alignment with
grouping-set schemas. Propagate the new grouping-set flag
through the physical planner and protobuf serialization,
and add a regression test for GROUPING SETS(()).
Introduce has_grouping_set flag in PhysicalGroupBy to ensure
grouping ID columns are generated and grouping-set execution paths
are honored when no grouping expressions are present. Propagate the
grouping-set flag through physical planning and protobuf serialization
to maintain __grouping_id metadata. Add a regression test for
GROUPING SETS(()) to verify aggregate schema and execution count.
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) proto Related to proto crate physical-plan Changes to the physical-plan crate labels Dec 10, 2025
@kosiew kosiew marked this pull request as ready for review December 10, 2025 15:30
Copy link

@ethan-tyler ethan-tyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @kosiew! The approach of tracking has_grouping_set explicitly LGTM.

Found one issue with the statistics path that I think needs fixing, and had a question
about some of the new code in no_grouping.rs.

Tested locally, SQL logic tests and aggregate tests pass.

match self.mode {
AggregateMode::Final | AggregateMode::FinalPartitioned
if self.group_by.expr.is_empty() =>
if self.group_by.is_true_no_grouping() =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be self.group_by.expr.is_empty() instead.

GROUPING SETS(()) still produces exactly one row (it's aggregating everything),
but is_true_no_grouping() returns false for it since has_grouping_set is true.
So we'd fall through to the inexact branch and report the wrong cardinality.

Suggested change
if self.group_by.is_true_no_grouping() =>
if self.group_by.expr.is_empty() =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amended.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the fix - expr.is_empty() is correct here since GROUPING SETS(())
produces one row regardless of has_grouping_set.

agg: &AggregateExec,
context: &Arc<TaskContext>,
partition: usize,
grouping_id: Option<ScalarValue>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is always None at the call site mod.rs:742 , is this scaffolding for
something planned, or can we drop it? If it's for future use, maybe add a
comment so nobody removes it thinking it's dead code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter's gone from the signature, but prepend_grouping_id_column is still
being called with hardcoded None, which makes the whole .and_then() block
a no-op.

Not blocking, but could remove the function and that call entirely since
GROUPING SETS(()) goes through the grouped path anyway. Can be a follow-up
cleanup if you'd rather not touch it now.

Comment on lines +215 to +226

# grouping_sets_with_empty_set
query I
SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (());
----
2

# grouping_sets_with_empty_set
query I
SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())
----
55

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to have both COUNT and SUM tested here.

Copy link

@ethan-tyler ethan-tyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the changes:

  1. Statistics cardinality fix correct
  2. has_grouping_set flag properly wired through PhysicalGroupBy, planner,
    and protobuf serialization
  3. is_true_no_grouping() correctly distinguishes "no GROUP BY" from
    "GROUPING SETS with empty expressions"
  4. Test coverage solid - both COUNT and SUM cases for GROUPING SETS(())

One minor thing: prepend_grouping_id_column is now dead code since it's
always called with None. Not blocking - can clean up in a follow-up.

LGTM, nice fix.

@kosiew
Copy link
Contributor Author

kosiew commented Dec 15, 2025

@ethan-tyler
Thanks for your review and feedback.

@kosiew kosiew requested a review from 2010YOUY01 December 15, 2025 06:32
@adriangb adriangb enabled auto-merge December 15, 2025 14:18
@adriangb adriangb disabled auto-merge December 15, 2025 14:19
adriangb added a commit to pydantic/datafusion that referenced this pull request Dec 15, 2025
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew !

} else if group_expr.is_empty() {
// No GROUP BY clause - create empty PhysicalGroupBy
Ok(PhysicalGroupBy::new(vec![], vec![], vec![]))
Ok(PhysicalGroupBy::new(vec![], vec![], vec![], false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor is now a bit hard to grok - could you decorate this call sight with a comment such as --no expressions, no null expressions and no grouping expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add comments to clarify this.

Comment on lines +175 to +182
/// Null mask for each group in this grouping set. Each group is
/// composed of either one of the group expressions in expr or a null
/// expression in null_expr. If `groups[i][j]` is true, then the
/// j-th expression in the i-th group is NULL, otherwise it is `expr[j]`.
groups: Vec<Vec<bool>>,
/// True when GROUPING SETS/CUBE/ROLLUP are used so `__grouping_id` should
/// be included in the output schema.
has_grouping_set: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to make groups: Option<Vec<Vec<bool>>> and give None the meaning of has_grouping_set: false and Some(vec![]) of true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.

I think keeping a separate has_grouping_set: bool is actually the cleaner and more approachable design here, for a few reasons.

Clear intent
The flag says exactly what it means: should __grouping_id be part of the output schema? That intent is immediately obvious when reading the code, instead of having to infer behavior from the state of an Option.

Easier to reason about
With an explicit boolean, the invariant is simple:

  • true → grouping-set mode
  • false → simple GROUP BY

Using an Option would require interpreting things like Some(vec![]) vs Some(vec![vec![false]]).

More readable in practice
Checks like:

if self.has_grouping_set { ... }

are easier to scan and understand than repeatedly pattern-matching on an Option

Fits the existing logic nicely
The current from_pre_group() logic already reads very clearly:

let groups = if self.expr.is_empty() && !self.has_grouping_set {
    // No GROUP BY expressions - should have no groups
    vec![]
} else {
    vec![vec![false; num_exprs]]
};

With an Option-based approach, this becomes harder to follow and less explicit about why we’re making this distinction.

Plays well with protobufs
A simple bool maps cleanly to the proto definition and avoids extra conversion or interpretation logic between Rust Options and protobuf semantics.

Separation of concerns
groups describes the structure of grouping (which expressions are used where), while has_grouping_set independently captures whether grouping-set mode is enabled at all (which affects schema and grouping ID generation).

@kosiew
Copy link
Contributor Author

kosiew commented Dec 16, 2025

@adriangb
Thanks for your review and feedback.

@kosiew kosiew added this pull request to the merge queue Dec 16, 2025
Merged via the queue into apache:main with commit c53a448 Dec 16, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid GROUPING SETS caused panic

3 participants