-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](agg) Fix grouping function handling and repeatSlotIdList calculation in DecomposeRepeatWithPreAggregation #60091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 31857 ms |
TPC-DS: Total hot run time: 174145 ms |
ClickBench: Total hot run time: 26.84 s |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
…sions that only exist in the largest grouping set
…ts slots sets return 1 when compute value
ae3ef77 to
43a9454
Compare
|
run buildall |
TPC-H: Total hot run time: 30728 ms |
TPC-DS: Total hot run time: 173968 ms |
ClickBench: Total hot run time: 26.83 s |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 30939 ms |
TPC-DS: Total hot run time: 173439 ms |
ClickBench: Total hot run time: 27.16 s |
|
run cloud_p0 |
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 32310 ms |
ClickBench: Total hot run time: 28.57 s |
|
run p0 |
|
run cloud_p0 |
|
run feut |
|
run p0 |
|
run cloud_p0 |
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
| * This is necessary for optimization scenarios where some expressions may only exist | ||
| * in the maximum grouping set that was removed during optimization. | ||
| */ | ||
| default GroupingSetShapes toShapes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ut for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| */ | ||
| default List<Set<Integer>> computeRepeatSlotIdList(List<Integer> slotIdList) { | ||
| List<Set<Integer>> groupingSetsIndexesInOutput = getGroupingSetsIndexesInOutput(); | ||
| default List<Set<Integer>> computeRepeatSlotIdList(List<Integer> slotIdList, List<Slot> outputSlots) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ut for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| */ | ||
| default List<Set<Integer>> getGroupingSetsIndexesInOutput() { | ||
| Map<Expression, Integer> indexMap = indexesOfOutput(); | ||
| default List<Set<Integer>> getGroupingSetsIndexesInOutput(List<Slot> outputSlots) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ut for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * ) | ||
| */ | ||
| default Map<Expression, Integer> indexesOfOutput() { | ||
| default Map<Expression, Integer> indexesOfOutput(List<Slot> outputSlots) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ut for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
|
|
||
| public Set<GroupingScalarFunction> getGroupingScalarFunctions() { | ||
| return new HashSet<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return immutable set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| */ | ||
| default GroupingSetShapes toShapes() { | ||
| Set<Expression> flattenGroupingSet = ImmutableSet.copyOf(ExpressionUtils.flatExpressions(getGroupingSets())); | ||
| Set<Expression> flattenGroupingSet = ImmutableSet.copyOf(getGroupByExpressions()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if flattenGroupingSet only used in L138 to construct allExpresssions, why copy twice in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
| * ) | ||
| */ | ||
| default Map<Expression, Integer> indexesOfOutput() { | ||
| default Map<Expression, Integer> indexesOfOutput(List<Slot> outputSlots) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add comment to explain why need pass outputSlots, not use outputExpression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
run buildall |
TPC-H: Total hot run time: 32104 ms |
ClickBench: Total hot run time: 28.33 s |
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #59116 #60045
Problem Summary:
There are 2 problems:
DecomposeRepeatWithPreAggregationoptimization removes the maximum grouping set, grouping functions (e.g.,grouping(),grouping_id()) that reference expressions only existing in the removed grouping set would fail because of index lookup failure: When computing grouping function values,GroupingSetShapes.indexOf()would return -1 for expressions not inflattenGroupingSetExpression, causing incorrect behavior.Example scenario:
After optimization removes the maximum grouping set
(a, b, c), the expressioncno longer exists in the remaining grouping sets. When computinggrouping(c), the index lookup would fail.2. The
repeatSlotIdListcalculation was incorrect because it relied on the assumption that the order ofphysicalRepeat.getOutputExpressions()matches the order ofoutputTuple. However, this assumption can be broken during rule rewriting (e.g., inDecomposeRepeatWithPreAggregation.constructRepeat()at line 502-503, wherereplacedRepeatOutputsis constructed bynew ArrayList<>(child.getOutput())and then appends grouping functions, potentially changing the order). This mismatch causes incorrect slot ID mapping and wrong query results.Example scenario:
When
DecomposeRepeatWithPreAggregationrewrites a repeat plan, the output expressions order may differ from the output tuple order, leading to incorrect slot ID mapping inrepeatSlotIdList.Solution:
Repeat.toShapes()method to ensure all expressions referenced by grouping functions are included inflattenGroupingSetExpression, even if they are not in any grouping set. For expressions that only exist in the removed maximum grouping set, they are correctly marked asshouldBeErasedToNull = truein all remaining grouping sets, which is the correct SQL semantics.Key changes:
Repeat.toShapes(): Enhanced to collect all expressions referenced by grouping functions usingExpressionUtils.collectToList()and merge them intoflattenGroupingSetExpressionshouldBeErasedToNull = truein allGroupingSetShapeinstancesGroupingSetShapes.indexOf()always returns a valid index for grouping function arguments, preventing index lookup failuresRepeat.computeRepeatSlotIdList()method to accept an additionaloutputSlotsparameter. Instead of relying on the order ofoutputExpressionsmatchingoutputTuple, the method now usesoutputSlotsto correctly map grouping set expressions to their indices in the output tuple. This ensures correct slot ID calculation even when rule rewriting changes the order of output expressions.Key changes:
computeRepeatSlotIdList(List<Integer> slotIdList, List<Slot> outputSlots): AddedoutputSlotsparametergetGroupingSetsIndexesInOutput(List<Slot> outputSlots): Modified to useoutputSlotsinstead ofgetOutputExpressions()indexesOfOutput(List<Slot> outputSlots): Modified to build index map fromoutputSlotsinstead ofgetOutputExpressions()PhysicalPlanTranslator.visitPhysicalRepeat(): Updated to passoutputSlotstocomputeRepeatSlotIdList()Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)