Skip to content

Conversation

@Kikyou1997
Copy link
Contributor

@Kikyou1997 Kikyou1997 commented Sep 20, 2023

  1. Fix data size calculation of auto sample, before this pr, the data size is include all the replicas
  2. Move some auto analyze related options to global session variable
  3. Add some logs

Proposed changes

Issue Number: close #xxx

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@Kikyou1997
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.05 seconds
stream load tsv: 596 seconds loaded 74807831229 Bytes, about 119 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162380106 Bytes

@Kikyou1997
Copy link
Contributor Author

run buildall

1 similar comment
@Kikyou1997
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.79 seconds
stream load tsv: 618 seconds loaded 74807831229 Bytes, about 115 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162479039 Bytes

Copy link
Contributor

@Jibing-Li Jibing-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

|enable_auto_sample|是否开启大表自动sample,开启后对于大小超过huge_table_lower_bound_size_in_bytes会自动通过采样收集| false|
|auto_analyze_job_record_count|控制统计信息的自动触发作业执行记录的持久化行数|20000|
|huge_table_default_sample_rows|定义开启开启大表自动sample后,对大表的采样行数|200000|
|huge_table_default_sample_percent|定义开启开启大表自动sample后,对大表的采样比例|10|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

大表的定义按行数来计算是不是比按datasize更好?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

理由是?

long dataSize = 0;
for (Partition partition : getAllPartitions()) {
dataSize += partition.getDataSize(false);
dataSize += partition.getDataSize(singleReplica);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for numberLike/dateLike column, it is not necessary to compute dateSize.
for example for int column
dataSize = 4* rowCount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

呃,你说的对 但是我没懂这和我这个改动有啥关系

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在这里判断数据类型,如果是定长字段,就不需要在这里for 循环累加

2. single replica for data size calc
3. move auto analyze related options to global session variable
@Kikyou1997
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.77 seconds
stream load tsv: 597 seconds loaded 74807831229 Bytes, about 119 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162433148 Bytes

long dataSize = 0;
for (Partition partition : getAllPartitions()) {
dataSize += partition.getDataSize(false);
dataSize += partition.getDataSize(singleReplica);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在这里判断数据类型,如果是定长字段,就不需要在这里for 循环累加

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 22, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@morrySnow morrySnow changed the title [fix](optimizer) Fix data size calculation of auto sample [fix](stats) Fix data size calculation of auto sample Sep 22, 2023
@morrySnow morrySnow merged commit c943a05 into apache:master Sep 22, 2023
morningman pushed a commit that referenced this pull request Oct 13, 2023
…anch 2.0 (#25119)

This PR is composed of belowing commits which has been merged to Doirs master:

* #24769
* #24672
* #24599
* #24521
* #24405
* #24237
* #24135
* #24074
* #24026
* #23992
* #23978
* #23622
* #23507
* #23354
* #23103
* #22963
* #22896
* #22775
* #22773
morningman pushed a commit that referenced this pull request Oct 15, 2023
….0 (#25421)

This PR is composed of belowing commits which has been merged to Doirs master:

* #24769
* #24672
* #24599
* #24521
* #24405
* #24237
* #24135
* #24074
* #24026
* #23992
* #23978
* #23622
* #23507
* #23354
* #23103
* #22963
* #22896
* #22775
* #22773

After this PR, when user upgrade Doris from 2.0.2 to 2.0.3, the origin info in AnalysisManager will be ignored, and the new module AnalysisManagerV2 will be saved(with more info).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.3-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants