fix(substrait): Do not add implicit groupBy expressions in LogicalPlanBuilder or when building logical plans from Substrait#14860
Conversation
|
Hi @anlinc It looks like there is a small clippy failure with this PR: https://github.com/apache/datafusion/actions/runs/13507870042/job/37742937287?pr=14860 |
| pub struct LogicalPlanBuilderOptions { | ||
| /// Flag indicating whether the plan builder should add | ||
| /// functionally dependent expressions as additional aggregation groupings. | ||
| add_implicit_group_by_exprs: bool, |
There was a problem hiding this comment.
Is the default now false? That's a breaking change I guess? Is that fine, or should we make the default be true to maintain earlier behavior?
There was a problem hiding this comment.
Yes, the default is now false.
My intuition was that this made the most sense as I'd prefer a composable base builder that suits all needs, that allows opt-ins to advanced features and optimizations.
There was a problem hiding this comment.
Although admittedly, I was not aware that this was going to be a breaking change, since SQL and Dataframes maintain the same behavior.
@vbarua informed me that it may be possible that there are those depending on the builder directly. In which case, I can see an argument for maintaining the default true behavior...
I'm a new contributor to this project so I'm lacking context. I might lean on you folks to make a more informed decision here.
There was a problem hiding this comment.
I lean towards having the default behaviour be false for this, even if it's a breaking change, because it makes the builder less surprising IMO. Specifically, when invoking the builder for an aggregate with a specific set of grouping expressions, my expectation is that it should produce an aggregate with those specific grouping expressions. If I wanted additional grouping expressions, I would have included them.
There's definitely room and value for optimizations like what is going on here, but I think those need to be opt-in to avoid situation like this were the plan builder tries to be smart along one specific axis and inadvertently shoots you in the foot in another. In the past, I think we've leaned towards having the builder be as straightforward as possible and then handling optimizations in the optimizer.
There was a problem hiding this comment.
I agree it would be less surprising, but I think we should consider it in a follow on PR. This PR fixes the issue with substrait and is a good change on its own. I have filed a follow on ticket to track the idea:
Update: I see this PR changes the defaults. I updated the title of the PR and will mark it as an API change
| #[tokio::test] | ||
| async fn multilayer_aggregate() -> Result<()> { | ||
| assert_expected_plan( | ||
| "SELECT a, sum(partial_count_b) FROM (SELECT a, count(b) as partial_count_b FROM data GROUP BY a) GROUP BY a", |
There was a problem hiding this comment.
I'm still a bit confused by this. Why does the original plan created from SQL already contain the implicit groupBy's here?
There was a problem hiding this comment.
This test goes SQL -> Logical Plan -> Substrait Plan -> Logical Plan.
There were no additional groupBys added between SQL -> Logical Plan -> Substrait Plan.
(Prior to this PR) Additional groupBys were added between Substrait Plan -> Logical Plan, violating the round trip plan expectations.
Blizzara
left a comment
There was a problem hiding this comment.
Thanks @anlinc , great to get this fixed and the approach here looks good! Just a note on the default in LogicalPlanBuilderOptions - should it be "true" to not break potential other users?
And a question on one of the tests, just for my understanding 😄
LogicalPlanBuilder or when building logical plans from Substrait
…anBuilder` or when building logical plans from Substrait (apache#14860) * feat: add add_implicit_group_by_exprs option to logical plan builder * fix: do not add implicity group by exprs in substrait path * test: add substrait tests * test: add builder option tests * style: clippy errors
Which issue does this PR close?
Closes #14348
Rationale for this change
Substrait plans are intended to be interpreted literally. When you see plan nodes like:
The output mapping (e.g.
[0, 3]) contains ordinals representing the offset of the target expression(s) within the [input, output] list. If the DataFusion LogicalPlanBuilder is introducing additional input expressions, this violates the plan's intent and will produce the incorrect output mappings. Please see the issue for a concrete example.What changes are included in this PR?
add_implicit_group_by_exprsoption to the logical plan builder. It is disabled by default.Are these changes tested?
Added a multilayer aggregation Substrait example. The first aggregation produces a unique column with a functional dependency. Despite this, the second aggregation must not introduce any additional grouping expressions.
There should be no changes in the non-Substrait paths.
Are there any user-facing changes?
Yes. LogicalPlanBuilder will not longer add implicity group by expressions by default.
SQL, DataFrame paths remain untouched.
Appendix
This is a continuation of #14553.