-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix panic for GROUPING SETS(()) and handle empty-grouping aggregates
#19252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement grouping ID defaults and final-stage handling for grouping sets, ensuring they expose __grouping_id in schemas, even without explicit columns. Modify no-grouping aggregate streams to append the grouping ID column as necessary, maintaining batch output alignment with grouping-set schemas. Propagate the new grouping-set flag through the physical planner and protobuf serialization, and add a regression test for GROUPING SETS(()).
Introduce has_grouping_set flag in PhysicalGroupBy to ensure grouping ID columns are generated and grouping-set execution paths are honored when no grouping expressions are present. Propagate the grouping-set flag through physical planning and protobuf serialization to maintain __grouping_id metadata. Add a regression test for GROUPING SETS(()) to verify aggregate schema and execution count.
… grouping ID usage
…ify grouping ID usage
… in AggregateStream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @kosiew! The approach of tracking has_grouping_set explicitly LGTM.
Found one issue with the statistics path that I think needs fixing, and had a question
about some of the new code in no_grouping.rs.
Tested locally, SQL logic tests and aggregate tests pass.
| match self.mode { | ||
| AggregateMode::Final | AggregateMode::FinalPartitioned | ||
| if self.group_by.expr.is_empty() => | ||
| if self.group_by.is_true_no_grouping() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be self.group_by.expr.is_empty() instead.
GROUPING SETS(()) still produces exactly one row (it's aggregating everything),
but is_true_no_grouping() returns false for it since has_grouping_set is true.
So we'd fall through to the inexact branch and report the wrong cardinality.
| if self.group_by.is_true_no_grouping() => | |
| if self.group_by.expr.is_empty() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified the fix - expr.is_empty() is correct here since GROUPING SETS(())
produces one row regardless of has_grouping_set.
| agg: &AggregateExec, | ||
| context: &Arc<TaskContext>, | ||
| partition: usize, | ||
| grouping_id: Option<ScalarValue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is always None at the call site mod.rs:742 , is this scaffolding for
something planned, or can we drop it? If it's for future use, maybe add a
comment so nobody removes it thinking it's dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameter's gone from the signature, but prepend_grouping_id_column is still
being called with hardcoded None, which makes the whole .and_then() block
a no-op.
Not blocking, but could remove the function and that call entirely since
GROUPING SETS(()) goes through the grouped path anyway. Can be a follow-up
cleanup if you'd rather not touch it now.
|
|
||
| # grouping_sets_with_empty_set | ||
| query I | ||
| SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (()); | ||
| ---- | ||
| 2 | ||
|
|
||
| # grouping_sets_with_empty_set | ||
| query I | ||
| SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(()) | ||
| ---- | ||
| 55 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to have both COUNT and SUM tested here.
ethan-tyler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified the changes:
- Statistics cardinality fix correct
- has_grouping_set flag properly wired through PhysicalGroupBy, planner,
and protobuf serialization - is_true_no_grouping() correctly distinguishes "no GROUP BY" from
"GROUPING SETS with empty expressions" - Test coverage solid - both COUNT and SUM cases for GROUPING SETS(())
One minor thing: prepend_grouping_id_column is now dead code since it's
always called with None. Not blocking - can clean up in a follow-up.
LGTM, nice fix.
|
@ethan-tyler |
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kosiew !
| } else if group_expr.is_empty() { | ||
| // No GROUP BY clause - create empty PhysicalGroupBy | ||
| Ok(PhysicalGroupBy::new(vec![], vec![], vec![])) | ||
| Ok(PhysicalGroupBy::new(vec![], vec![], vec![], false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor is now a bit hard to grok - could you decorate this call sight with a comment such as --no expressions, no null expressions and no grouping expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add comments to clarify this.
| /// Null mask for each group in this grouping set. Each group is | ||
| /// composed of either one of the group expressions in expr or a null | ||
| /// expression in null_expr. If `groups[i][j]` is true, then the | ||
| /// j-th expression in the i-th group is NULL, otherwise it is `expr[j]`. | ||
| groups: Vec<Vec<bool>>, | ||
| /// True when GROUPING SETS/CUBE/ROLLUP are used so `__grouping_id` should | ||
| /// be included in the output schema. | ||
| has_grouping_set: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to make groups: Option<Vec<Vec<bool>>> and give None the meaning of has_grouping_set: false and Some(vec![]) of true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
I think keeping a separate has_grouping_set: bool is actually the cleaner and more approachable design here, for a few reasons.
Clear intent
The flag says exactly what it means: should __grouping_id be part of the output schema? That intent is immediately obvious when reading the code, instead of having to infer behavior from the state of an Option.
Easier to reason about
With an explicit boolean, the invariant is simple:
true→ grouping-set modefalse→ simpleGROUP BY
Using an Option would require interpreting things like Some(vec![]) vs Some(vec![vec![false]]).
More readable in practice
Checks like:
if self.has_grouping_set { ... }are easier to scan and understand than repeatedly pattern-matching on an Option
Fits the existing logic nicely
The current from_pre_group() logic already reads very clearly:
let groups = if self.expr.is_empty() && !self.has_grouping_set {
// No GROUP BY expressions - should have no groups
vec![]
} else {
vec![vec![false; num_exprs]]
};With an Option-based approach, this becomes harder to follow and less explicit about why we’re making this distinction.
Plays well with protobufs
A simple bool maps cleanly to the proto definition and avoids extra conversion or interpretation logic between Rust Options and protobuf semantics.
Separation of concerns
groups describes the structure of grouping (which expressions are used where), while has_grouping_set independently captures whether grouping-set mode is enabled at all (which affects schema and grouping ID generation).
|
@adriangb |
Which issue does this PR close?
GROUPING SETScaused panic #18974.Rationale for this change
The DataFusion CLI currently panics with an "index out of bounds" error when executing queries that use
GROUP BY GROUPING SETS(()), such as:This panic originates in the physical aggregation code, which assumes that an empty list of grouping expressions always corresponds to "no grouping". That assumption breaks down in the presence of
GROUPING SETS, where an empty set is a valid grouping set that should still produce a result row (and__grouping_id) rather than crashing.This PR fixes the panic by explicitly distinguishing:
GROUPING SETS/CUBE/ROLLUPplans that may have empty grouping expressions but still require grouping-set semantics and a valid__grouping_id.The change restores robustness of the CLI and ensures standards-compliant behavior for grouping sets with empty sets.
What changes are included in this PR?
Summary of the main changes:
Track grouping-set usage explicitly in
PhysicalGroupBy:Add a
has_grouping_set: boolfield toPhysicalGroupBy.Extend
PhysicalGroupBy::newto accept thehas_grouping_setflag.Add helper methods:
has_grouping_set(&self) -> boolto expose the flag, andis_true_no_grouping(&self) -> boolto represent the case of genuinely no grouping (no GROUP BY and no grouping sets).Correct group state construction for empty grouping with grouping sets:
PhysicalGroupBy::from_pre_groupso that it only treatsexpr.is_empty()as "no groups" whenhas_grouping_setisfalse.GROUPING SETS(()), we now build at least one group, avoiding the previous out-of-bounds access ongroups[0].Clarify when
__grouping_idshould be present:is_singlelogic with a clearer distinction based onhas_grouping_set.num_output_exprs,output_exprs,num_group_exprs, andgroup_schemanow add the__grouping_idcolumn only whenhas_grouping_setistrue.is_singleis redefined as "simple GROUP BY" (no grouping sets), i.e.!self.has_grouping_set.Integrate the new semantics into
AggregateExec:group_by.is_true_no_grouping()instead ofgroup_by.expr.is_empty()when choosing between the specialized no-grouping aggregation path and grouped aggregation.is_unordered_unfiltered_group_by_distinctonly treats plans as grouped when there are grouping expressions and no grouping sets (!has_grouping_set).GROUP BYwhile correctly handlingGROUPING SETSand related constructs.Support
__grouping_idwith the no-grouping aggregation stream:AggregateStreamInnerwith an optionalgrouping_id: Option<ScalarValue>field.AggregateStream::newto accept agrouping_idargument.prepend_grouping_id_columnto prepend a__grouping_idcolumn to the finalized accumulator output when needed.__grouping_idin grouping-set scenarios.Planner and execution wiring updates:
Update all
PhysicalGroupBy::newcall sites to pass the correcthas_grouping_setvalue:falsefor:GROUP BYor truly no-grouping aggregates.truefor:GROUPING SETS,CUBE, andROLLUPphysical planning paths.Ensure
merge_grouping_set_physical_expr,create_cube_physical_expr, andcreate_rollup_physical_exprcorrectly mark grouping-set plans.Protobuf / physical plan round-trip support:
AggregateExecNodeindatafusion.protowith a newbool has_grouping_set = 12;field.pbjsonandprostcode to serialize and deserialize the new field.AggregateExecfrom protobuf, pass the decodedhas_grouping_setintoPhysicalGroupBy::new.AggregateExecback to protobuf, sethas_grouping_setbased onexec.group_expr().has_grouping_set().Tests and SQL logic coverage:
Add sqllogictests for the previously failing cases in
grouping.slt:SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (());SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())(the original panic case).Extend or adjust unit tests in
aggregates,physical_planner,filter_pushdown, andcoopmodules to account for thehas_grouping_setflag inPhysicalGroupByand expected debug output.Update proto round-trip tests to validate
has_grouping_setis preserved.Are these changes tested?
Yes.
New sqllogictests covering
GROUPING SETS(())for both a regular table andgenerate_series(10):grouping.sltnow asserts the expected scalar results (e.g.2and55), preventing regressions on this edge case.Updated and existing Rust unit tests:
physical-plan/src/aggregatestests updated to includehas_grouping_setinPhysicalGroupByexpectations.physical_planner.rs,filter_pushdown) updated to constructPhysicalGroupBywith the new flag.core/tests/execution/coop.rsupdated to reflect the new constructor and continue to exercise the no-grouping aggregation path.has_grouping_setis correctly serialized and deserialized.These tests collectively ensure that:
GROUPING SETS(())are correct, andAre there any user-facing changes?
Yes, but they are bug fixes and behavior clarifications rather than breaking changes:
Queries using
GROUP BY GROUPING SETS(())no longer cause a runtime panic in the DataFusion CLI.COUNT(*)orSUM(v1)), consistent with SQL semantics.For plans using
GROUPING SETS,CUBE, orROLLUP, the internal__grouping_idcolumn is now present consistently whenever grouping sets are in use, even when the grouping expressions are empty.For ordinary
GROUP BYqueries that do not use grouping sets, behavior is unchanged: no unexpected__grouping_idcolumn is added.No API signatures were changed in a breaking way for downstream users; the additions are internal flags and protobuf fields to accurately represent the physical plan.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.