-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](stats) Fix data size calculation of auto sample #24672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ab411a9 to
1c0c065
Compare
|
run buildall |
|
(From new machine)TeamCity pipeline, clickbench performance test result: |
1c0c065 to
ce9c8ed
Compare
|
run buildall |
1 similar comment
|
run buildall |
|
(From new machine)TeamCity pipeline, clickbench performance test result: |
Jibing-Li
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by anyone and no changes requested. |
| |enable_auto_sample|是否开启大表自动sample,开启后对于大小超过huge_table_lower_bound_size_in_bytes会自动通过采样收集| false| | ||
| |auto_analyze_job_record_count|控制统计信息的自动触发作业执行记录的持久化行数|20000| | ||
| |huge_table_default_sample_rows|定义开启开启大表自动sample后,对大表的采样行数|200000| | ||
| |huge_table_default_sample_percent|定义开启开启大表自动sample后,对大表的采样比例|10| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
大表的定义按行数来计算是不是比按datasize更好?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
理由是?
| long dataSize = 0; | ||
| for (Partition partition : getAllPartitions()) { | ||
| dataSize += partition.getDataSize(false); | ||
| dataSize += partition.getDataSize(singleReplica); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for numberLike/dateLike column, it is not necessary to compute dateSize.
for example for int column
dataSize = 4* rowCount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
呃,你说的对 但是我没懂这和我这个改动有啥关系
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在这里判断数据类型,如果是定长字段,就不需要在这里for 循环累加
2. single replica for data size calc 3. move auto analyze related options to global session variable
ce9c8ed to
18bfa0b
Compare
|
run buildall |
|
(From new machine)TeamCity pipeline, clickbench performance test result: |
| long dataSize = 0; | ||
| for (Partition partition : getAllPartitions()) { | ||
| dataSize += partition.getDataSize(false); | ||
| dataSize += partition.getDataSize(singleReplica); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在这里判断数据类型,如果是定长字段,就不需要在这里for 循环累加
|
PR approved by at least one committer and no changes requested. |
….0 (#25421) This PR is composed of belowing commits which has been merged to Doirs master: * #24769 * #24672 * #24599 * #24521 * #24405 * #24237 * #24135 * #24074 * #24026 * #23992 * #23978 * #23622 * #23507 * #23354 * #23103 * #22963 * #22896 * #22775 * #22773 After this PR, when user upgrade Doris from 2.0.2 to 2.0.3, the origin info in AnalysisManager will be ignored, and the new module AnalysisManagerV2 will be saved(with more info).
Proposed changes
Issue Number: close #xxx
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...