[Fix](Nereids)fix group by binding error, resulting in incorrect results #15328

zhengshiJ · 2022-12-23T11:47:13Z

Proposed changes

Issue Number: close #xxx

Problem summary

Original: group by is bound to the outputExpression of the current node.

Problem: When the name of the new reference of outputExpression is the same as the child's output column, the child's output column should be used for group by, but at this time, the new reference of the node's outputExpression will be used for group by, resulting in an error

Now: Give priority to the child's output for group by binding. If the child does not have a corresponding column, use the outputExpression of this node for binding

eg:

select coalesce(col1, 'all') as col1, count(*) as cnt from (select  null as col1  union all  select  'a' as col1 ) t group by grouping sets ((col1),());

before: wrong result

+------+------+
| col1 | cnt  |
+------+------+
| all  |    1 |
| a    |    1 |
| NULL |    2 |
+------+------+

now: right result

+------+------+
| col1 | cnt  |
+------+------+
| all  |    1 |
| a    |    1 |
| all  |    2 |
+------+------+

before

+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| PLAN FRAGMENT 0                                          |
|   OUTPUT EXPRS:                                          |
|     col1[#8]                                             |
|     cnt[#9]                                              |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   VRESULT SINK                                           |
|                                                          |
|   4:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 1                                          |
|                                                          |
|   PARTITION: HASH_PARTITIONED: col1[#3], GROUPING_ID[#4] |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 04                                      |
|     UNPARTITIONED                                        |
|                                                          |
|   3:VAGGREGATE (update finalize)                         |
|   |  output: count(*)[#7]                                |
|   |  group by: col1[#3], GROUPING_ID[#4]                 |
|   |  cardinality=2                                       |
|   |  projections: col1[#5], cnt[#7]                      |
|   |  project output tuple id: 5                          |
|   |                                                      |
|   2:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 2                                          |
|                                                          |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 02                                      |
|     HASH_PARTITIONED: col1[#3], GROUPING_ID[#4]          |
|                                                          |
|   1:VREPEAT_NODE                                         |
|   |  repeat: repeat 1 lines [[3], []]                    |
|   |  exprs: col1[#1]                                     |
|   |  output slots: `null`, `GROUPING_ID`                 |
|   |                                                      |
|   0:VUNION                                               |
|      constant exprs:                                     |
|          NULL                                            |
|          'a'                                             |
|      projections: coalesce(col1[#0], 'all')              |
|      project output tuple id: 1                          |
+----------------------------------------------------------+

now

+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| PLAN FRAGMENT 0                                          |
|   OUTPUT EXPRS:                                          |
|     col1[#7]                                             |
|     cnt[#8]                                              |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   VRESULT SINK                                           |
|                                                          |
|   4:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 1                                          |
|                                                          |
|   PARTITION: HASH_PARTITIONED: col1[#2], GROUPING_ID[#3] |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 04                                      |
|     UNPARTITIONED                                        |
|                                                          |
|   3:VAGGREGATE (update finalize)                         |
|   |  output: count(*)[#6]                                |
|   |  group by: col1[#2], GROUPING_ID[#3]                 |
|   |  cardinality=2                                       |
|   |  projections: coalesce(col1[#4], 'all'), cnt[#6]     |
|   |  project output tuple id: 4                          |
|   |                                                      |
|   2:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 2                                          |
|                                                          |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 02                                      |
|     HASH_PARTITIONED: col1[#2], GROUPING_ID[#3]          |
|                                                          |
|   1:VREPEAT_NODE                                         |
|   |  repeat: repeat 1 lines [[2], []]                    |
|   |  exprs: col1[#0]                                     |
|   |  output slots: `null`, `GROUPING_ID`                 |
|   |                                                      |
|   0:VUNION                                               |
|      constant exprs:                                     |
|          NULL                                            |
|          'a'                                             |
+----------------------------------------------------------+

Checklist(Required)

Does it affect the original behavior:
- Yes
- No
- I don't know
Has unit tests been added:
- Yes
- No
- No Need
Has document been added or modified:
- Yes
- No
- No Need
Does it need to update dependencies:
- Yes
- No
Are there any changes that cannot be rolled back:
- Yes (If Yes, please explain WHY)
- No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

hello-stephen · 2022-12-23T15:25:11Z

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 35 seconds
load time: 655 seconds
storage size: 17122858713 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221227111322_clickbench_pr_69368.html

morrySnow

i think we need more comment to explain the purpose of the change in NormalizeToSlot. Currently, it is hard to understand why add these code to let repeat work correctlly.

morrySnow · 2022-12-26T16:18:38Z

the root cause of this problem is that when we do bind Aggregate, we should bind coalesce(col1, 'all') on the Aggregate's output, not it's input's output.

the simple case is

CREATE TABLE `t3` (
  `c1` int(11) NULL,
  `c2` text NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`c1`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);

insert into t3 values(1, "a1"), (2, "a2"), (3, "a3");

select substring(c2, 1, 1) as c2, count(1) from t3 group by c2;

the legacy planner's result is

+------+----------+
| c2   | count(1) |
+------+----------+
| a    |        1 |
| a    |        1 |
| a    |        1 |
+------+----------+

the Nereids' result is

+------+----------+
| c2   | count(1) |
+------+----------+
| a    |        3 |
+------+----------+

github-actions · 2022-12-27T10:06:06Z

PR approved by at least one committer and no changes requested.

github-actions · 2022-12-27T10:06:09Z

PR approved by anyone and no changes requested.

github-actions bot added area/nereids kind/test labels Dec 23, 2022

morrySnow reviewed Dec 26, 2022

View reviewed changes

zhengshiJ changed the title ~~[Fix](Nereids)fix scalarFunction and groupingSets~~ [Fix](Nereids)fix group by binding error, resulting in incorrect results Dec 27, 2022

jianghaochen added 2 commits December 27, 2022 17:14

[Fix](Nereids)fix scalarFunction and groupingSets

e6afa52

fix

461f4d4

zhengshiJ force-pushed the fixGroupingSets branch from 2a03586 to 461f4d4 Compare December 27, 2022 09:14

924060929 approved these changes Dec 27, 2022

View reviewed changes

github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Dec 27, 2022

924060929 merged commit 2af831d into apache:master Dec 28, 2022

zhengshiJ mentioned this pull request Dec 29, 2022

[Fix](Nereids) Group by binding should be consistent with the behavior of the old optimizer #15484

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix](Nereids)fix group by binding error, resulting in incorrect results #15328

[Fix](Nereids)fix group by binding error, resulting in incorrect results #15328

Uh oh!

zhengshiJ commented Dec 23, 2022 •

edited

Loading

Uh oh!

hello-stephen commented Dec 23, 2022 •

edited

Loading

Uh oh!

morrySnow left a comment •

edited

Loading

Uh oh!

morrySnow commented Dec 26, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Dec 27, 2022

Uh oh!

github-actions bot commented Dec 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Fix](Nereids)fix group by binding error, resulting in incorrect results #15328

[Fix](Nereids)fix group by binding error, resulting in incorrect results #15328

Uh oh!

Conversation

zhengshiJ commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Problem summary

Checklist(Required)

Further comments

Uh oh!

hello-stephen commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morrySnow left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morrySnow commented Dec 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 27, 2022

Uh oh!

github-actions bot commented Dec 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengshiJ commented Dec 23, 2022 •

edited

Loading

hello-stephen commented Dec 23, 2022 •

edited

Loading

morrySnow left a comment •

edited

Loading

morrySnow commented Dec 26, 2022 •

edited

Loading