[fix](stats) Fix data size calculation of auto sample #24672

Kikyou1997 · 2023-09-20T06:53:13Z

Fix data size calculation of auto sample, before this pr, the data size is include all the replicas
Move some auto analyze related options to global session variable
Add some logs

Proposed changes

Issue Number: close #xxx

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Kikyou1997 · 2023-09-20T07:18:04Z

run buildall

doris-robot · 2023-09-20T08:00:13Z

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.05 seconds
stream load tsv: 596 seconds loaded 74807831229 Bytes, about 119 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162380106 Bytes

Kikyou1997 · 2023-09-20T11:17:33Z

run buildall

Kikyou1997 · 2023-09-20T11:32:25Z

run buildall

doris-robot · 2023-09-20T11:57:24Z

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.79 seconds
stream load tsv: 618 seconds loaded 74807831229 Bytes, about 115 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162479039 Bytes

Jibing-Li

LGTM

github-actions · 2023-09-21T02:00:44Z

PR approved by anyone and no changes requested.

englefly · 2023-09-21T01:58:48Z

docs/zh-CN/docs/query-acceleration/statistics.md

 |enable_auto_sample|是否开启大表自动sample，开启后对于大小超过huge_table_lower_bound_size_in_bytes会自动通过采样收集| false|
 |auto_analyze_job_record_count|控制统计信息的自动触发作业执行记录的持久化行数|20000|
-|huge_table_default_sample_rows|定义开启开启大表自动sample后，对大表的采样行数|200000|
+|huge_table_default_sample_percent|定义开启开启大表自动sample后，对大表的采样比例|10|


大表的定义按行数来计算是不是比按datasize更好？

理由是？

englefly · 2023-09-21T02:02:26Z

fe/fe-core/src/main/java/org/apache/doris/catalog/OlapTable.java

        long dataSize = 0;
        for (Partition partition : getAllPartitions()) {
-            dataSize += partition.getDataSize(false);
+            dataSize += partition.getDataSize(singleReplica);


for numberLike/dateLike column, it is not necessary to compute dateSize.
for example for int column
dataSize = 4* rowCount

呃，你说的对但是我没懂这和我这个改动有啥关系

在这里判断数据类型，如果是定长字段，就不需要在这里for 循环累加

2. single replica for data size calc 3. move auto analyze related options to global session variable

Kikyou1997 · 2023-09-21T03:07:44Z

run buildall

doris-robot · 2023-09-21T03:57:16Z

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.77 seconds
stream load tsv: 597 seconds loaded 74807831229 Bytes, about 119 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162433148 Bytes

englefly · 2023-09-22T06:16:28Z

fe/fe-core/src/main/java/org/apache/doris/catalog/OlapTable.java

        long dataSize = 0;
        for (Partition partition : getAllPartitions()) {
-            dataSize += partition.getDataSize(false);
+            dataSize += partition.getDataSize(singleReplica);


在这里判断数据类型，如果是定长字段，就不需要在这里for 循环累加

github-actions · 2023-09-22T06:17:28Z

PR approved by at least one committer and no changes requested.

…anch 2.0 (#25119) This PR is composed of belowing commits which has been merged to Doirs master: * #24769 * #24672 * #24599 * #24521 * #24405 * #24237 * #24135 * #24074 * #24026 * #23992 * #23978 * #23622 * #23507 * #23354 * #23103 * #22963 * #22896 * #22775 * #22773

….0 (#25421) This PR is composed of belowing commits which has been merged to Doirs master: * #24769 * #24672 * #24599 * #24521 * #24405 * #24237 * #24135 * #24074 * #24026 * #23992 * #23978 * #23622 * #23507 * #23354 * #23103 * #22963 * #22896 * #22775 * #22773 After this PR, when user upgrade Doris from 2.0.2 to 2.0.3, the origin info in AnalysisManager will be ignored, and the new module AnalysisManagerV2 will be saved(with more info).

Kikyou1997 force-pushed the fix/data_size branch from ab411a9 to 1c0c065 Compare September 20, 2023 07:17

Kikyou1997 force-pushed the fix/data_size branch from 1c0c065 to ce9c8ed Compare September 20, 2023 11:17

Jibing-Li approved these changes Sep 21, 2023

View reviewed changes

github-actions bot added the reviewed label Sep 21, 2023

englefly reviewed Sep 21, 2023

View reviewed changes

1. disable page cache

18bfa0b

2. single replica for data size calc 3. move auto analyze related options to global session variable

Kikyou1997 force-pushed the fix/data_size branch from ce9c8ed to 18bfa0b Compare September 21, 2023 03:05

englefly approved these changes Sep 22, 2023

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 22, 2023

morrySnow changed the title ~~[fix](optimizer) Fix data size calculation of auto sample~~ [fix](stats) Fix data size calculation of auto sample Sep 22, 2023

morrySnow merged commit c943a05 into apache:master Sep 22, 2023

Kikyou1997 mentioned this pull request Oct 12, 2023

[refactor](stats) Migrate stats framework from master to branch 2.0 #25119

Merged

xiaokang added the dev/2.0.3-merged label Oct 13, 2023

morningman mentioned this pull request Oct 15, 2023

Migrate stats framework from master to branch-2.0 #25421

Merged

[fix](stats) Fix data size calculation of auto sample #24672

[fix](stats) Fix data size calculation of auto sample #24672

Uh oh!

Conversation

Kikyou1997 commented Sep 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Further comments

Uh oh!

Kikyou1997 commented Sep 20, 2023

Uh oh!

doris-robot commented Sep 20, 2023

Uh oh!

Kikyou1997 commented Sep 20, 2023

Uh oh!

Kikyou1997 commented Sep 20, 2023

Uh oh!

doris-robot commented Sep 20, 2023

Uh oh!

Jibing-Li left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 21, 2023

Uh oh!

englefly Sep 21, 2023

Choose a reason for hiding this comment

Uh oh!

Kikyou1997 Sep 21, 2023

Choose a reason for hiding this comment

Uh oh!

englefly Sep 21, 2023

Choose a reason for hiding this comment

Uh oh!

Kikyou1997 Sep 21, 2023

Choose a reason for hiding this comment

Uh oh!

englefly Sep 22, 2023

Choose a reason for hiding this comment

Uh oh!

Kikyou1997 commented Sep 21, 2023

Uh oh!

doris-robot commented Sep 21, 2023

Uh oh!

englefly Sep 22, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Kikyou1997 commented Sep 20, 2023 •

edited

Loading