Skip to content

Conversation

@zhengshiJ
Copy link
Contributor

@zhengshiJ zhengshiJ commented Dec 23, 2022

Proposed changes

Issue Number: close #xxx

Problem summary

Original: group by is bound to the outputExpression of the current node.

Problem: When the name of the new reference of outputExpression is the same as the child's output column, the child's output column should be used for group by, but at this time, the new reference of the node's outputExpression will be used for group by, resulting in an error

Now: Give priority to the child's output for group by binding. If the child does not have a corresponding column, use the outputExpression of this node for binding

eg:

select coalesce(col1, 'all') as col1, count(*) as cnt from (select  null as col1  union all  select  'a' as col1 ) t group by grouping sets ((col1),());

before: wrong result

+------+------+
| col1 | cnt  |
+------+------+
| all  |    1 |
| a    |    1 |
| NULL |    2 |
+------+------+

now: right result

+------+------+
| col1 | cnt  |
+------+------+
| all  |    1 |
| a    |    1 |
| all  |    2 |
+------+------+

before

+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| PLAN FRAGMENT 0                                          |
|   OUTPUT EXPRS:                                          |
|     col1[#8]                                             |
|     cnt[#9]                                              |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   VRESULT SINK                                           |
|                                                          |
|   4:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 1                                          |
|                                                          |
|   PARTITION: HASH_PARTITIONED: col1[#3], GROUPING_ID[#4] |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 04                                      |
|     UNPARTITIONED                                        |
|                                                          |
|   3:VAGGREGATE (update finalize)                         |
|   |  output: count(*)[#7]                                |
|   |  group by: col1[#3], GROUPING_ID[#4]                 |
|   |  cardinality=2                                       |
|   |  projections: col1[#5], cnt[#7]                      |
|   |  project output tuple id: 5                          |
|   |                                                      |
|   2:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 2                                          |
|                                                          |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 02                                      |
|     HASH_PARTITIONED: col1[#3], GROUPING_ID[#4]          |
|                                                          |
|   1:VREPEAT_NODE                                         |
|   |  repeat: repeat 1 lines [[3], []]                    |
|   |  exprs: col1[#1]                                     |
|   |  output slots: `null`, `GROUPING_ID`                 |
|   |                                                      |
|   0:VUNION                                               |
|      constant exprs:                                     |
|          NULL                                            |
|          'a'                                             |
|      projections: coalesce(col1[#0], 'all')              |
|      project output tuple id: 1                          |
+----------------------------------------------------------+

now

+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| PLAN FRAGMENT 0                                          |
|   OUTPUT EXPRS:                                          |
|     col1[#7]                                             |
|     cnt[#8]                                              |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   VRESULT SINK                                           |
|                                                          |
|   4:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 1                                          |
|                                                          |
|   PARTITION: HASH_PARTITIONED: col1[#2], GROUPING_ID[#3] |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 04                                      |
|     UNPARTITIONED                                        |
|                                                          |
|   3:VAGGREGATE (update finalize)                         |
|   |  output: count(*)[#6]                                |
|   |  group by: col1[#2], GROUPING_ID[#3]                 |
|   |  cardinality=2                                       |
|   |  projections: coalesce(col1[#4], 'all'), cnt[#6]     |
|   |  project output tuple id: 4                          |
|   |                                                      |
|   2:VEXCHANGE                                            |
|                                                          |
| PLAN FRAGMENT 2                                          |
|                                                          |
|   PARTITION: UNPARTITIONED                               |
|                                                          |
|   STREAM DATA SINK                                       |
|     EXCHANGE ID: 02                                      |
|     HASH_PARTITIONED: col1[#2], GROUPING_ID[#3]          |
|                                                          |
|   1:VREPEAT_NODE                                         |
|   |  repeat: repeat 1 lines [[2], []]                    |
|   |  exprs: col1[#0]                                     |
|   |  output slots: `null`, `GROUPING_ID`                 |
|   |                                                      |
|   0:VUNION                                               |
|      constant exprs:                                     |
|          NULL                                            |
|          'a'                                             |
+----------------------------------------------------------+

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@hello-stephen
Copy link
Contributor

hello-stephen commented Dec 23, 2022

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 35 seconds
load time: 655 seconds
storage size: 17122858713 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221227111322_clickbench_pr_69368.html

Copy link
Contributor

@morrySnow morrySnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need more comment to explain the purpose of the change in NormalizeToSlot. Currently, it is hard to understand why add these code to let repeat work correctlly.

@morrySnow
Copy link
Contributor

morrySnow commented Dec 26, 2022

the root cause of this problem is that when we do bind Aggregate, we should bind coalesce(col1, 'all') on the Aggregate's output, not it's input's output.

the simple case is

CREATE TABLE `t3` (
  `c1` int(11) NULL,
  `c2` text NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`c1`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);

insert into t3 values(1, "a1"), (2, "a2"), (3, "a3");

select substring(c2, 1, 1) as c2, count(1) from t3 group by c2;

the legacy planner's result is

+------+----------+
| c2   | count(1) |
+------+----------+
| a    |        1 |
| a    |        1 |
| a    |        1 |
+------+----------+

the Nereids' result is

+------+----------+
| c2   | count(1) |
+------+----------+
| a    |        3 |
+------+----------+

@zhengshiJ zhengshiJ changed the title [Fix](Nereids)fix scalarFunction and groupingSets [Fix](Nereids)fix group by binding error, resulting in incorrect results Dec 27, 2022
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Dec 27, 2022
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/nereids kind/test reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants