[Optimization](statistics) optimize Incremental statistics collection and statistics cleaning #18653

weizhengte · 2023-04-13T14:11:13Z

Proposed changes

This pr mainly optimizes the following items:

the collection of statistics: clear up invalid historical statistics before collecting them, so as not to affect the final table statistics.
the incremental collection of statistics: in the case of incremental collection, only the corresponding partition statistics need to be collected.

TODO: Supports incremental collection of materialized view statistics.

Issue Number: close #xxx

Problem summary

Describe your changes.

Checklist(Required)

Does it affect the original behavior
Has unit tests been added
Has document been added or modified
Does it need to update dependencies
Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

weizhengte · 2023-04-13T18:16:28Z

run buildall

hello-stephen · 2023-04-14T01:19:31Z

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 35.84 seconds
stream load tsv: 440 seconds loaded 74807831229 Bytes, about 162 MB/s
stream load json: 22 seconds loaded 2358488459 Bytes, about 102 MB/s
stream load orc: 61 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230418004344_clickbench_pr_130377.html

fe/fe-core/src/main/java/org/apache/doris/statistics/AnalysisHelper.java

fe/fe-common/src/main/java/org/apache/doris/common/Config.java

fe/fe-core/src/main/java/org/apache/doris/statistics/AnalysisHelper.java

fe/fe-core/src/main/java/org/apache/doris/statistics/StatisticsRepository.java

fe/fe-core/src/main/java/org/apache/doris/statistics/AnalysisManager.java

Kikyou1997

Further modifications and discussions are needed for this PR before it could be merged

weizhengte · 2023-04-16T11:14:31Z

run buildall

weizhengte · 2023-04-16T23:46:05Z

Further modifications and discussions are needed for this PR before it could be merged

I have removed the code that is not part of this pr~

weizhengte · 2023-04-17T17:22:10Z

run buildall

github-actions · 2023-04-18T03:56:58Z

PR approved by anyone and no changes requested.

weizhengte · 2023-04-19T02:42:20Z

Wait until the sampling statistics pr is comerged before modifying

…18880) 1. Supports sampling to collect statistics 2. Improved syntax for collecting statistics 3. Support histogram specifies the number of buckets 4. Tweaked some code structure --- The syntax supports WITH and PROPERTIES, using the same syntax as before. Column Statistics Collection Syntax: ```SQL ANALYZE [ SYNC ] TABLE table_name [ (column_name [, ...]) ] [ [WITH SYNC] | [WITH INCREMENTAL] | [WITH SAMPLE PERCENT | ROWS ] ] [ PROPERTIES ('key' = 'value', ...) ]; ``` Column histogram collection syntax: ```SQL ANALYZE [ SYNC ] TABLE table_name [ (column_name [, ...]) ] UPDATE HISTOGRAM [ [ WITH SYNC ][ WITH INCREMENTAL ][ WITH SAMPLE PERCENT | ROWS ][ WITH BUCKETS ] ] [ PROPERTIES ('key' = 'value', ...) ]; ``` Illustrate： - sync：Collect statistics synchronously. Return after collecting. - incremental：Collect statistics incrementally. Incremental collection of histogram statistics is not supported. - sample percent | rows：Collect statistics by sampling. Scale and number of rows can be sampled. - buckets：Specifies the maximum number of buckets generated when collecting histogram statistics. - table_name: The purpose table for collecting statistics. Can be of the form `db_name.table_name`. - column_name: The specified destination column must be a column that exists in `table_name`, and multiple column names are separated by commas. - properties：Properties used to set statistics tasks. Currently only the following configurations are supported (equivalent to the with statement) - 'sync' = 'true' - 'incremental' = 'true' - 'sample.percent' = '50' - 'sample.rows' = '1000' - 'num.buckets' = 10 --- TODO: - Supplement the complete p0 test - `Incremental` statistics see #18653

…pache#18880) 1. Supports sampling to collect statistics 2. Improved syntax for collecting statistics 3. Support histogram specifies the number of buckets 4. Tweaked some code structure --- The syntax supports WITH and PROPERTIES, using the same syntax as before. Column Statistics Collection Syntax: ```SQL ANALYZE [ SYNC ] TABLE table_name [ (column_name [, ...]) ] [ [WITH SYNC] | [WITH INCREMENTAL] | [WITH SAMPLE PERCENT | ROWS ] ] [ PROPERTIES ('key' = 'value', ...) ]; ``` Column histogram collection syntax: ```SQL ANALYZE [ SYNC ] TABLE table_name [ (column_name [, ...]) ] UPDATE HISTOGRAM [ [ WITH SYNC ][ WITH INCREMENTAL ][ WITH SAMPLE PERCENT | ROWS ][ WITH BUCKETS ] ] [ PROPERTIES ('key' = 'value', ...) ]; ``` Illustrate： - sync：Collect statistics synchronously. Return after collecting. - incremental：Collect statistics incrementally. Incremental collection of histogram statistics is not supported. - sample percent | rows：Collect statistics by sampling. Scale and number of rows can be sampled. - buckets：Specifies the maximum number of buckets generated when collecting histogram statistics. - table_name: The purpose table for collecting statistics. Can be of the form `db_name.table_name`. - column_name: The specified destination column must be a column that exists in `table_name`, and multiple column names are separated by commas. - properties：Properties used to set statistics tasks. Currently only the following configurations are supported (equivalent to the with statement) - 'sync' = 'true' - 'incremental' = 'true' - 'sample.percent' = '50' - 'sample.rows' = '1000' - 'num.buckets' = 10 --- TODO: - Supplement the complete p0 test - `Incremental` statistics see apache#18653

weizhengte · 2023-04-23T03:42:24Z

run buildall

weizhengte · 2023-04-23T04:15:01Z

run clickbench

…pache#18880) 1. Supports sampling to collect statistics 2. Improved syntax for collecting statistics 3. Support histogram specifies the number of buckets 4. Tweaked some code structure --- The syntax supports WITH and PROPERTIES, using the same syntax as before. Column Statistics Collection Syntax: ```SQL ANALYZE [ SYNC ] TABLE table_name [ (column_name [, ...]) ] [ [WITH SYNC] | [WITH INCREMENTAL] | [WITH SAMPLE PERCENT | ROWS ] ] [ PROPERTIES ('key' = 'value', ...) ]; ``` Column histogram collection syntax: ```SQL ANALYZE [ SYNC ] TABLE table_name [ (column_name [, ...]) ] UPDATE HISTOGRAM [ [ WITH SYNC ][ WITH INCREMENTAL ][ WITH SAMPLE PERCENT | ROWS ][ WITH BUCKETS ] ] [ PROPERTIES ('key' = 'value', ...) ]; ``` Illustrate： - sync：Collect statistics synchronously. Return after collecting. - incremental：Collect statistics incrementally. Incremental collection of histogram statistics is not supported. - sample percent | rows：Collect statistics by sampling. Scale and number of rows can be sampled. - buckets：Specifies the maximum number of buckets generated when collecting histogram statistics. - table_name: The purpose table for collecting statistics. Can be of the form `db_name.table_name`. - column_name: The specified destination column must be a column that exists in `table_name`, and multiple column names are separated by commas. - properties：Properties used to set statistics tasks. Currently only the following configurations are supported (equivalent to the with statement) - 'sync' = 'true' - 'incremental' = 'true' - 'sample.percent' = '50' - 'sample.rows' = '1000' - 'num.buckets' = 10 --- TODO: - Supplement the complete p0 test - `Incremental` statistics see apache#18653

weizhengte force-pushed the opt-drop_stats branch from 44ee653 to 1a093b5 Compare April 13, 2023 18:15