From be4fa2af463fe9ed67a899e3c4f256f7feb58112 Mon Sep 17 00:00:00 2001 From: Liuxiaozhen12 Date: Fri, 11 Jun 2021 19:16:35 +0800 Subject: [PATCH 1/3] optimizer: modify docs for analyze behavior --- statistics.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/statistics.md b/statistics.md index 18508e2d64a6a..1fb42701e67c0 100644 --- a/statistics.md +++ b/statistics.md @@ -6,7 +6,7 @@ aliases: ['/docs/dev/statistics/','/docs/dev/reference/performance/statistics/'] # Introduction to Statistics -TiDB uses statistics to decide [which index to choose](/choose-index.md). The `tidb_analyze_version` variable controls the statistics collected by TiDB. Currently, two versions of statistics are supported: `tidb_analyze_version = 1` (by default) and `tidb_analyze_version = 2`. These two versions include different information in TiDB: +TiDB uses statistics to decide [which index to choose](/choose-index.md). The `tidb_analyze_version` variable controls the statistics collected by TiDB. Currently, two versions of statistics are supported: `tidb_analyze_version = 1` and `tidb_analyze_version = 2`. In versions before v5.1, the default value of this variable is `1`. In v5.1, the default value of this variable is `2`, which is enabled as an experimental feature. These two versions include different information in TiDB: | Information | Version 1 | Version 2| | --- | --- | ---| @@ -22,7 +22,7 @@ TiDB uses statistics to decide [which index to choose](/choose-index.md). The `t | The average length of columns | √ | √ | | The average length of indexes | √ | √ | -Compared to Version 1, Version 2 statistics avoids the potential inaccuracy caused by hash collision when the data volume is huge. It also increases the estimate precision in most scenarios. +Compared to Version 1, Version 2 statistics avoids the potential inaccuracy caused by hash collision when the data volume is huge. It also maintains the estimate precision in most scenarios. This document briefly introduces the histogram, Count-Min Sketch, and Top-N, and details the collection and maintenance of statistics. @@ -62,6 +62,8 @@ You can run the `ANALYZE` statement to collect statistics. > For quicker analysis, you can set `tidb_enable_fast_analyze` to `1` to enable the Quick Analysis feature. The default value for this parameter is `0`. > > After Quick Analysis is enabled, TiDB randomly samples approximately 10,000 rows of data to build statistics. Therefore, in the case of uneven data distribution or a relatively small amount of data, the accuracy of statistical information is relatively poor. It might lead to poor execution plans, such as choosing the wrong index. If the execution time of the normal `ANALYZE` statement is acceptable, it is recommended to disable the Quick Analysis feature. +> +> `tidb_enable_fast_analyze` is an experimental feature, which at present **does not match exactly** with the statistical information of `tidb_analyze_version=2`. Therefore, you need to set the value of `tidb_analyze_version` to `1` when enabling `tidb_enable_fast_analyze`. #### Full collection @@ -107,6 +109,10 @@ You can perform full collection using the following syntax. ANALYZE TABLE TableName PARTITION PartitionNameList INDEX [IndexNameList] [WITH NUM BUCKETS|TOPN|CMSKETCH DEPTH|CMSKETCH WIDTH|SAMPLES]; ``` +> **Note:** +> +> To ensure the consistency of the statistical information both before and after, when you set `tidb_analyze_version=2`, `ANALYZE TABLE TableName INDEX` will also collect statistics of the whole table instead of the given index. + #### Incremental collection To improve the speed of analysis after full collection, incremental collection could be used to analyze the newly added sections in monotonically non-decreasing columns such as time columns. @@ -270,10 +276,10 @@ Currently, the `SHOW STATS_HISTOGRAMS` statement returns the following 10 column | `column_name` | The column name (when `is_index` is `0`) or the index name (when `is_index` is `1`) | | `is_index` | Whether it is an index column or not | | `update_time` | The time of the update | -| `version` | The value of `tidb_analyze_version` in the corresponding `ANALYZE` statement | | `distinct_count` | The number of different values | | `null_count` | The number of `NULL` | | `avg_col_size` | The average length of columns | +| correlation | The Pearson correlation coefficient of the column and the integer primary key, which indicates the degree of association between the two columns| ### Buckets of histogram From fd2594ea3c59f519bf6e1d10f70debfbb2718672 Mon Sep 17 00:00:00 2001 From: Xiaozhen Liu <82579298+Liuxiaozhen12@users.noreply.github.com> Date: Fri, 18 Jun 2021 11:02:46 +0800 Subject: [PATCH 2/3] Apply suggestions from code review Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- statistics.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/statistics.md b/statistics.md index 1fb42701e67c0..44e5d345f62e4 100644 --- a/statistics.md +++ b/statistics.md @@ -6,7 +6,7 @@ aliases: ['/docs/dev/statistics/','/docs/dev/reference/performance/statistics/'] # Introduction to Statistics -TiDB uses statistics to decide [which index to choose](/choose-index.md). The `tidb_analyze_version` variable controls the statistics collected by TiDB. Currently, two versions of statistics are supported: `tidb_analyze_version = 1` and `tidb_analyze_version = 2`. In versions before v5.1, the default value of this variable is `1`. In v5.1, the default value of this variable is `2`, which is enabled as an experimental feature. These two versions include different information in TiDB: +TiDB uses statistics to decide [which index to choose](/choose-index.md). The `tidb_analyze_version` variable controls the statistics collected by TiDB. Currently, two versions of statistics are supported: `tidb_analyze_version = 1` and `tidb_analyze_version = 2`. In versions before v5.1, the default value of this variable is `1`. In v5.1, the default value of this variable is `2`, which serves as an experimental feature. These two versions include different information in TiDB: | Information | Version 1 | Version 2| | --- | --- | ---| @@ -63,7 +63,7 @@ You can run the `ANALYZE` statement to collect statistics. > > After Quick Analysis is enabled, TiDB randomly samples approximately 10,000 rows of data to build statistics. Therefore, in the case of uneven data distribution or a relatively small amount of data, the accuracy of statistical information is relatively poor. It might lead to poor execution plans, such as choosing the wrong index. If the execution time of the normal `ANALYZE` statement is acceptable, it is recommended to disable the Quick Analysis feature. > -> `tidb_enable_fast_analyze` is an experimental feature, which at present **does not match exactly** with the statistical information of `tidb_analyze_version=2`. Therefore, you need to set the value of `tidb_analyze_version` to `1` when enabling `tidb_enable_fast_analyze`. +> `tidb_enable_fast_analyze` is an experimental feature, which currently **does not match exactly** with the statistical information of `tidb_analyze_version=2`. Therefore, you need to set the value of `tidb_analyze_version` to `1` when `tidb_enable_fast_analyze` is enabled. #### Full collection From d67830e416c95d0289c9eababc13f5ed38458e35 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Fri, 18 Jun 2021 14:21:41 +0800 Subject: [PATCH 3/3] Update statistics.md --- statistics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/statistics.md b/statistics.md index 44e5d345f62e4..cd7f8ed0d17f3 100644 --- a/statistics.md +++ b/statistics.md @@ -111,7 +111,7 @@ You can perform full collection using the following syntax. > **Note:** > -> To ensure the consistency of the statistical information both before and after, when you set `tidb_analyze_version=2`, `ANALYZE TABLE TableName INDEX` will also collect statistics of the whole table instead of the given index. +> To ensure that the statistical information before and after the collection is consistent, when you set `tidb_analyze_version=2`, `ANALYZE TABLE TableName INDEX` will also collect statistics of the whole table instead of the given index. #### Incremental collection