From dc242b11601f4edb2f1adaceb97050ba0728af79 Mon Sep 17 00:00:00 2001 From: Yiding Cui Date: Tue, 9 Aug 2022 14:53:01 +0800 Subject: [PATCH 01/24] statistics: add some doc for the exp feature --- system-variables.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/system-variables.md b/system-variables.md index 05d64efdce163..760a6facf53ee 100644 --- a/system-variables.md +++ b/system-variables.md @@ -905,6 +905,13 @@ Constraint checking is always performed in place for pessimistic transactions (d - `RESTRICTED_VARIABLES_ADMIN`: The ability to see and set sensitive variables in `SHOW [GLOBAL] VARIABLES` and `SET`. - `RESTRICTED_USER_ADMIN`: The ability to prevent other users from making changes or dropping a user account. +### tidb_enable_extended_stats + +- Scope: SESSION | GLOBAL +- Type: Boolean +- Default value: `OFF` +- This variable indicates whether TiDB can collect the extended statistic to guide the optimizer. Refer to the chapter [Introduction to Extended Statistics](./extended-statistics.md) for more information. + ### tidb_enable_fast_analyze > **Warning:** From fd59fb23ccea4fa4ad514d7dd10d1db8fe0d6981 Mon Sep 17 00:00:00 2001 From: Yiding Cui Date: Tue, 9 Aug 2022 14:58:37 +0800 Subject: [PATCH 02/24] add the new file --- extended-statistics.md | 114 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 extended-statistics.md diff --git a/extended-statistics.md b/extended-statistics.md new file mode 100644 index 0000000000000..7875a73ab36d3 --- /dev/null +++ b/extended-statistics.md @@ -0,0 +1,114 @@ +--- +title: Introduction to Extended Statistics +summary: Learn how to use extended statistics to guide the optimizer. +--- + +# Introduction to Extended Statistics + +The statistics mentioned in the [Introduction to Statistics](/statistics.md) section, including histograms and Count-Min Sketch, are regular statistics. This information is collected each time statistics are collected manually or automatically. Another class of statistics, as opposed to common statistics, is extended statistics, which are only helpful for optimizer estimation in a specific scenario. + +Since they are only helpful in specific scenarios, extended statistics are not collected during the default manual or automatic `ANALYZE` to avoid the overhead of managing statistics. If you want to collect extended statistics, you need to "register" them with SQL commands first. Then TiDB will collect these registered extended statistics in addition to the regular statistics the next time you manually or automatically `ANALYZE`. + +# The registration of the Extended Statistics + +If you want to register the extended statistics, you can use the SQL `ALTER TABLE ADD STATS_EXTENDED`. The grammar is shown below: + +{{< copyable "sql" >}} + +```sql +ALTER TABLE table_name ADD STATS_EXTENDED IF NOT EXISTS stats_name stats_type(column_name, column_name...); +``` + +This statement indicates that you want to collect the specified type of extended statistics on the specified columns of the table and name it. + +- `table_name` is the table that you want to collect the extended statistics. +- `stats_name` is the name of the extended statistics. It should be unique for each table. +- `stats_type` is the type of the extended statistics. Now it only has one possible value `correlation`. +- `column_name` specifies the column group. It can be multiple columns. For `correlation` type, there should be and only be two columns. + +The extended statistics will be collected if the `mysql.stats_extended` has the corresponding record when we run the `ANALYZE` command. And the `status` column will be set to `1`, and the `version`` column will be set to the new timestamp. + +## The type of the Extended Statistics + +### Correlation + +This is the only supported type of extended statistics. The registration SQL is like the following: + +{{< copyable "sql" >}} + +```sql +ALTER TABLE t ADD STATS_EXTENDED s1 correlation(col1, col2); +``` + +When we run the `ANALYZE` after the registration, TiDB will calculate the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of the `col` and `col2` of the table `t` and write the record into the table `mysql.stats_extended`. + +It's used to improve TiDB's index selection for the following scenario: + +For a table `t` described below: + +{{< copyable "sql" >}} + +```sql +CREATE TABLE t(col1 INT, col2 INT, KEY(col1), KEY(col2)); +``` + +Suppose that the `col1` and `col2` of the table `t` both obey monotonically increasing constraints in row order, i.e., the values of `col1` and `col2` are strictly correlated in order (correlation value of 1): + +{{< copyable "sql" >}} + +```sql +SELECT * FROM t WHERE col1 > 1 ORDER BY col2 LIMIT 1; +``` + +For the above query, the optimizer has two choices to access the table `t`: one uses the index on `col1` to access the table and then sorts the result by `col2` to calculate the `Top-1`. Another is that access the table by index on `col2` to meet the first row that satisfies `col1 > 1`. The latter's cost mainly depends on how many rows are filtered out when we scan the table in `col2`'s order. Usually, the optimizer can only suppose that `col1` and `col2` are independent, leading to a significant estimation error. + +After the TiDB has the extended statistics for correlation, the optimizer can estimate how many rows we need to scan more precisely. Since the `col1` and `col2` are strictly correlated in order, the optimizer will equivalently translate the row count estimate for option two above into: + +{{< copyable "sql" >}} + +```sql +SELECT * FROM t WHERE col1 <= 1 OR col1 IS NULL; +``` + +The above estimation plus one will be the final estimation for the condition. This way, we don't need to use the independent assumption to get a significant estimation error. +The optimizer will use the independent assumption if the correlation factor is less than the system variable `tidb_opt_correlation_threshold`. But it will increase the estimation heuristically. The larger the system variable `tidb_opt_correlation_exp_factor` is, the larger the estimation result is. The larger the absolute value of the correlation factor is, the larger the estimation result is. + +## The collection of the Extended Statistics +After registration, the TiDB will collect the extended statistic after triggering the ANALYZE command manually or automatically. TiDB will not collect the extended statistics if we only collect the indexes' statistics. It also will not collect it if it's `ANALYZE INCREMENTAL` or the `tidb_enable_fast_analyze` is true. + +## The cache of the Extended Statistics + +Each TiDB node will maintain a cache for the extended statistics to improve the efficiency of visiting the extended statistics. TiDB will load the table `mysql.`stats_extended` periodically to ensure that the cache is kept the same as the data in the table. Each row in the table `mysql.stats_extended` records a column `version`. Once the row is updated, the value of the column `version` will be increased so that we can load the table into the memory incrementally instead of a full loading. +To delete a record of the extended statistics, TiDB provides the following command: + +{{< copyable "sql" >}} + +```sql +ALTER TABLE table_name DROP STATS_EXTENDED stats_name; +``` + +This command will mark the value of the corresponding record in the table `mysql.stats_extended`'s column `status` to `2`(meaning that the record is deleted) instead of deleting the record directly. Other TiDBs will read this change and delete the record in their memory cache. The background garbage collection will delete the record eventually. + +Don't operate the table `mysql.stats_extended` directly. This can cause the inconsistency of the cache of each TiDB node. If you do such an operation wrongly, you can use the following command to load the data of the table fully instead of incrementally: + +{{< copyable "sql" >}} + +```sql +ADMIN RELOAD STATS_EXTENDED; +``` + +## The dump and load of the Extended Statistics + +The way mentioned in the chapter [Introduction to Statistics](/statistics.md) is also suitable for extended statistics. The dump result is in the same JSON file as the normal statistics. + +## The switch + +You can use the following command to enable the feature: + +{{< copyable "sql" >}} + +```sql +set global tidb_enable_extended_stats = on; +``` + +The default value of `tidb_enable_extended_stats` is `off`. From 70200046e68c9e78e626c0459e321ce55ad73443 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 9 Aug 2022 15:24:08 +0800 Subject: [PATCH 03/24] Update extended-statistics.md --- extended-statistics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extended-statistics.md b/extended-statistics.md index 7875a73ab36d3..ca942c23005cb 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -26,7 +26,7 @@ This statement indicates that you want to collect the specified type of extended - `stats_type` is the type of the extended statistics. Now it only has one possible value `correlation`. - `column_name` specifies the column group. It can be multiple columns. For `correlation` type, there should be and only be two columns. -The extended statistics will be collected if the `mysql.stats_extended` has the corresponding record when we run the `ANALYZE` command. And the `status` column will be set to `1`, and the `version`` column will be set to the new timestamp. +The extended statistics will be collected if the `mysql.stats_extended` has the corresponding record when we run the `ANALYZE` command. And the `status` column will be set to `1`, and the `version` column will be set to the new timestamp. ## The type of the Extended Statistics From 36c4e78260f2024967e99084740fde0e3c669220 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 9 Aug 2022 15:24:40 +0800 Subject: [PATCH 04/24] Update extended-statistics.md --- extended-statistics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extended-statistics.md b/extended-statistics.md index ca942c23005cb..5f9d99ae4aeb5 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -78,7 +78,7 @@ After registration, the TiDB will collect the extended statistic after triggerin ## The cache of the Extended Statistics -Each TiDB node will maintain a cache for the extended statistics to improve the efficiency of visiting the extended statistics. TiDB will load the table `mysql.`stats_extended` periodically to ensure that the cache is kept the same as the data in the table. Each row in the table `mysql.stats_extended` records a column `version`. Once the row is updated, the value of the column `version` will be increased so that we can load the table into the memory incrementally instead of a full loading. +Each TiDB node will maintain a cache for the extended statistics to improve the efficiency of visiting the extended statistics. TiDB will load the table `mysql.stats_extended` periodically to ensure that the cache is kept the same as the data in the table. Each row in the table `mysql.stats_extended` records a column `version`. Once the row is updated, the value of the column `version` will be increased so that we can load the table into the memory incrementally instead of a full loading. To delete a record of the extended statistics, TiDB provides the following command: {{< copyable "sql" >}} From 95e424c4b75ac89457632c65bceaa7f9eb60f375 Mon Sep 17 00:00:00 2001 From: Yiding Cui Date: Thu, 18 Aug 2022 02:10:56 +0800 Subject: [PATCH 05/24] address comments --- extended-statistics.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 5f9d99ae4aeb5..adbf627d28c4c 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -5,9 +5,9 @@ summary: Learn how to use extended statistics to guide the optimizer. # Introduction to Extended Statistics -The statistics mentioned in the [Introduction to Statistics](/statistics.md) section, including histograms and Count-Min Sketch, are regular statistics. This information is collected each time statistics are collected manually or automatically. Another class of statistics, as opposed to common statistics, is extended statistics, which are only helpful for optimizer estimation in a specific scenario. +The statistics mentioned in the [Introduction to Statistics](/statistics.md) section, including histograms and Count-Min Sketch, are common statistics. This information is collected each time statistics are collected manually or automatically. Another class of statistics, as opposed to common statistics, is extended statistics, which are only helpful for optimizer estimation in a specific scenario. -Since they are only helpful in specific scenarios, extended statistics are not collected during the default manual or automatic `ANALYZE` to avoid the overhead of managing statistics. If you want to collect extended statistics, you need to "register" them with SQL commands first. Then TiDB will collect these registered extended statistics in addition to the regular statistics the next time you manually or automatically `ANALYZE`. +Since they are only helpful in specific scenarios, extended statistics are not collected during the default manual or automatic `ANALYZE` to avoid the overhead of managing statistics. If you want to collect extended statistics, you need to "register" them with SQL commands first. Then TiDB will collect these registered extended statistics in addition to the common statistics the next time you manually or automatically `ANALYZE`. # The registration of the Extended Statistics @@ -32,7 +32,7 @@ The extended statistics will be collected if the `mysql.stats_extended` has the ### Correlation -This is the only supported type of extended statistics. The registration SQL is like the following: +The registration SQL is like the following: {{< copyable "sql" >}} @@ -52,7 +52,7 @@ For a table `t` described below: CREATE TABLE t(col1 INT, col2 INT, KEY(col1), KEY(col2)); ``` -Suppose that the `col1` and `col2` of the table `t` both obey monotonically increasing constraints in row order, i.e., the values of `col1` and `col2` are strictly correlated in order (correlation value of 1): +Suppose that the `col1` and `col2` of the table `t` both obey monotonically increasing constraints in row order, i.e., the values of `col1` and `col2` are strictly correlated in order (the value of the correlation is 1): {{< copyable "sql" >}} @@ -74,11 +74,17 @@ The above estimation plus one will be the final estimation for the condition. Th The optimizer will use the independent assumption if the correlation factor is less than the system variable `tidb_opt_correlation_threshold`. But it will increase the estimation heuristically. The larger the system variable `tidb_opt_correlation_exp_factor` is, the larger the estimation result is. The larger the absolute value of the correlation factor is, the larger the estimation result is. ## The collection of the Extended Statistics -After registration, the TiDB will collect the extended statistic after triggering the ANALYZE command manually or automatically. TiDB will not collect the extended statistics if we only collect the indexes' statistics. It also will not collect it if it's `ANALYZE INCREMENTAL` or the `tidb_enable_fast_analyze` is true. -## The cache of the Extended Statistics +After registration, TiDB collects the extended statistic with the ANALYZE command manually or automatically, except below scenarios: + +- Statistics collection on indexes only +- Statistics collection with `ANALYZE INCREMENTAL` command +- Statistics collection with variable `tidb_enable_fast_analyze` is true + +## The deletion of the Extended Statistics Each TiDB node will maintain a cache for the extended statistics to improve the efficiency of visiting the extended statistics. TiDB will load the table `mysql.stats_extended` periodically to ensure that the cache is kept the same as the data in the table. Each row in the table `mysql.stats_extended` records a column `version`. Once the row is updated, the value of the column `version` will be increased so that we can load the table into the memory incrementally instead of a full loading. + To delete a record of the extended statistics, TiDB provides the following command: {{< copyable "sql" >}} From 7546e8d7b2583cae8070cbf76f457c828d25b5ff Mon Sep 17 00:00:00 2001 From: Yiding Cui Date: Tue, 30 Aug 2022 00:38:49 +0800 Subject: [PATCH 06/24] address comments --- extended-statistics.md | 91 ++++++++++++++++++++++-------------------- 1 file changed, 48 insertions(+), 43 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index adbf627d28c4c..95ce22c825af7 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -9,7 +9,21 @@ The statistics mentioned in the [Introduction to Statistics](/statistics.md) sec Since they are only helpful in specific scenarios, extended statistics are not collected during the default manual or automatic `ANALYZE` to avoid the overhead of managing statistics. If you want to collect extended statistics, you need to "register" them with SQL commands first. Then TiDB will collect these registered extended statistics in addition to the common statistics the next time you manually or automatically `ANALYZE`. -# The registration of the Extended Statistics +## How to enable the Extended Statistics + +You can use the following command to enable the feature: + +{{< copyable "sql" >}} + +```sql +set global tidb_enable_extended_stats = on; +``` + +The default value of `tidb_enable_extended_stats` is `off`. + +## SQL Grammar + +### The registration of the Extended Statistics If you want to register the extended statistics, you can use the SQL `ALTER TABLE ADD STATS_EXTENDED`. The grammar is shown below: @@ -28,6 +42,39 @@ This statement indicates that you want to collect the specified type of extended The extended statistics will be collected if the `mysql.stats_extended` has the corresponding record when we run the `ANALYZE` command. And the `status` column will be set to `1`, and the `version` column will be set to the new timestamp. +### The deletion of the Extended Statistics + +Each TiDB node will maintain a cache for the extended statistics to improve the efficiency of visiting the extended statistics. TiDB will load the table `mysql.stats_extended` periodically to ensure that the cache is kept the same as the data in the table. Each row in the table `mysql.stats_extended` records a column `version`. Once the row is updated, the value of the column `version` will be increased so that we can load the table into the memory incrementally instead of a full loading. + +To delete a record of the extended statistics, TiDB provides the following command: + +{{< copyable "sql" >}} + +```sql +ALTER TABLE table_name DROP STATS_EXTENDED stats_name; +``` + +This command will mark the value of the corresponding record in the table `mysql.stats_extended`'s column `status` to `2`(meaning that the record is deleted) instead of deleting the record directly. Other TiDBs will read this change and delete the record in their memory cache. The background garbage collection will delete the record eventually. + +### Flush the cache of one TiDB node + +We don't suggest you directly operate on the table `mysql.stats_extended`. The direct operation on the table would not manifest in the cache, which may cause the inconsistency of the cache on the different TiDB nodes. + +If you do such an operation wrongly, you can use the following command on each TiDB node to load the data of the table fully instead of incrementally: + +{{< copyable "sql" >}} + +```sql +ADMIN RELOAD STATS_EXTENDED; +``` +### Collecting the Extended Statistics + +After registration, TiDB collects the extended statistic with the `ANALYZE` command manually or automatically, except below scenarios: + +- Statistics collection on indexes only +- Statistics collection with `ANALYZE INCREMENTAL` command +- Statistics collection with variable `tidb_enable_fast_analyze` is true + ## The type of the Extended Statistics ### Correlation @@ -73,48 +120,6 @@ SELECT * FROM t WHERE col1 <= 1 OR col1 IS NULL; The above estimation plus one will be the final estimation for the condition. This way, we don't need to use the independent assumption to get a significant estimation error. The optimizer will use the independent assumption if the correlation factor is less than the system variable `tidb_opt_correlation_threshold`. But it will increase the estimation heuristically. The larger the system variable `tidb_opt_correlation_exp_factor` is, the larger the estimation result is. The larger the absolute value of the correlation factor is, the larger the estimation result is. -## The collection of the Extended Statistics - -After registration, TiDB collects the extended statistic with the ANALYZE command manually or automatically, except below scenarios: - -- Statistics collection on indexes only -- Statistics collection with `ANALYZE INCREMENTAL` command -- Statistics collection with variable `tidb_enable_fast_analyze` is true - -## The deletion of the Extended Statistics - -Each TiDB node will maintain a cache for the extended statistics to improve the efficiency of visiting the extended statistics. TiDB will load the table `mysql.stats_extended` periodically to ensure that the cache is kept the same as the data in the table. Each row in the table `mysql.stats_extended` records a column `version`. Once the row is updated, the value of the column `version` will be increased so that we can load the table into the memory incrementally instead of a full loading. - -To delete a record of the extended statistics, TiDB provides the following command: - -{{< copyable "sql" >}} - -```sql -ALTER TABLE table_name DROP STATS_EXTENDED stats_name; -``` - -This command will mark the value of the corresponding record in the table `mysql.stats_extended`'s column `status` to `2`(meaning that the record is deleted) instead of deleting the record directly. Other TiDBs will read this change and delete the record in their memory cache. The background garbage collection will delete the record eventually. - -Don't operate the table `mysql.stats_extended` directly. This can cause the inconsistency of the cache of each TiDB node. If you do such an operation wrongly, you can use the following command to load the data of the table fully instead of incrementally: - -{{< copyable "sql" >}} - -```sql -ADMIN RELOAD STATS_EXTENDED; -``` - ## The dump and load of the Extended Statistics The way mentioned in the chapter [Introduction to Statistics](/statistics.md) is also suitable for extended statistics. The dump result is in the same JSON file as the normal statistics. - -## The switch - -You can use the following command to enable the feature: - -{{< copyable "sql" >}} - -```sql -set global tidb_enable_extended_stats = on; -``` - -The default value of `tidb_enable_extended_stats` is `off`. From b0f419388edfe3eb4713042691dcf354f13bf161 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 7 Sep 2022 19:04:55 +0800 Subject: [PATCH 07/24] refine docs batch 1 --- extended-statistics.md | 80 +++++++++++++++++++++++------------------- 1 file changed, 43 insertions(+), 37 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 95ce22c825af7..1947ca4a46df2 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -5,79 +5,87 @@ summary: Learn how to use extended statistics to guide the optimizer. # Introduction to Extended Statistics -The statistics mentioned in the [Introduction to Statistics](/statistics.md) section, including histograms and Count-Min Sketch, are common statistics. This information is collected each time statistics are collected manually or automatically. Another class of statistics, as opposed to common statistics, is extended statistics, which are only helpful for optimizer estimation in a specific scenario. +TiDB can collect the following two types of statistics: -Since they are only helpful in specific scenarios, extended statistics are not collected during the default manual or automatic `ANALYZE` to avoid the overhead of managing statistics. If you want to collect extended statistics, you need to "register" them with SQL commands first. Then TiDB will collect these registered extended statistics in addition to the common statistics the next time you manually or automatically `ANALYZE`. +- Regular statistics: statistics such as histograms and Count-Min Sketch. See [Introduction to Statistics](/statistics.md) for details. +- Extended statistics: statistics filtered by tables and columns. -## How to enable the Extended Statistics +Because the extended statistics are only used for optimizer estimates in specific scenarios, when the `ANALYZE` statement is executed manually or automatically, to reduce the overhead of managing statistics, TiDB only collects the regular statistics and does not collect the extended statistics by default. -You can use the following command to enable the feature: +Extended statistics are disabled by default. To collect extended statistics, you need to enable and register the extended statistics first. -{{< copyable "sql" >}} +After the registration, the next time the `ANALYZE` statement is executed manually or automatically, TiDB collects both the regular statistics and the registered extended statistics. -```sql -set global tidb_enable_extended_stats = on; -``` +## Limitations + +Extended statistics are not collected in the following scenarios: -The default value of `tidb_enable_extended_stats` is `off`. +- Statistics collection on indexes only +- Statistics collection with the `ANALYZE INCREMENTAL` command +- Statistics collection with the value of the `tidb_enable_fast_analyze` system variable set to `true` -## SQL Grammar +## Common operations -### The registration of the Extended Statistics +### Enable extended statistics -If you want to register the extended statistics, you can use the SQL `ALTER TABLE ADD STATS_EXTENDED`. The grammar is shown below: +To enable extended statistics, set the system variable `tidb_enable_extended_stats` to `ON`: -{{< copyable "sql" >}} +```sql +SET GLOBAL tidb_enable_extended_stats = ON; +``` + +The default value of this variable is `OFF`. + +### Register extended statistics + +To register the extended statistics, use the SQL statement `ALTER TABLE ADD STATS_EXTENDED`. The syntax is as follows: ```sql ALTER TABLE table_name ADD STATS_EXTENDED IF NOT EXISTS stats_name stats_type(column_name, column_name...); ``` -This statement indicates that you want to collect the specified type of extended statistics on the specified columns of the table and name it. +In the statement, you can specify the table name, statistics type, statistics type, and column name of the extended statistics to be collected. -- `table_name` is the table that you want to collect the extended statistics. -- `stats_name` is the name of the extended statistics. It should be unique for each table. -- `stats_type` is the type of the extended statistics. Now it only has one possible value `correlation`. -- `column_name` specifies the column group. It can be multiple columns. For `correlation` type, there should be and only be two columns. +- `table_name` specifies the name of the table from which the extended statistics are collected. +- `stats_name` specifies the name of the statistics, which must be unique for each table. +- `stats_type` specifies the type of the statistics. Currently, only the correlation type is supported. +- `column_name` specifies the column group, which might have multiple columns. Currently, you can only specify two column names. -The extended statistics will be collected if the `mysql.stats_extended` has the corresponding record when we run the `ANALYZE` command. And the `status` column will be set to `1`, and the `version` column will be set to the new timestamp. +Each TiDB node maintains a cache in the system table `mysql.stats_extended` for extended statistics, which improve access performance. After you register the extended statistics, if the system table `mysql.stats_extended` has the corresponding records, the next time the `ANALYZE` statement is executed, TiDB will collect the extended statistics. -### The deletion of the Extended Statistics +Each row in the `mysql.stats_extended` table records a `version` column. Once a row is updated, the value of the column `version` is increased, so that TiDB can load the table into the memory incrementally instead of fully. -Each TiDB node will maintain a cache for the extended statistics to improve the efficiency of visiting the extended statistics. TiDB will load the table `mysql.stats_extended` periodically to ensure that the cache is kept the same as the data in the table. Each row in the table `mysql.stats_extended` records a column `version`. Once the row is updated, the value of the column `version` will be increased so that we can load the table into the memory incrementally instead of a full loading. +TiDB loads `mysql.stats_extended` periodically to ensure that the cache is kept the same as the data in the table. -To delete a record of the extended statistics, TiDB provides the following command: +### Delete extended statistics -{{< copyable "sql" >}} +To delete a record of the extended statistics, use the following statement: ```sql ALTER TABLE table_name DROP STATS_EXTENDED stats_name; ``` -This command will mark the value of the corresponding record in the table `mysql.stats_extended`'s column `status` to `2`(meaning that the record is deleted) instead of deleting the record directly. Other TiDBs will read this change and delete the record in their memory cache. The background garbage collection will delete the record eventually. +After you execute the statement, TiDB marks the value of the corresponding record in `mysql.stats_extended`'s column `status` to `2`, which means that the record is deleted, instead of deleting the record directly. -### Flush the cache of one TiDB node +Other TiDB nodes will read this change and delete the record in their memory cache. The background garbage collection will delete the record eventually. -We don't suggest you directly operate on the table `mysql.stats_extended`. The direct operation on the table would not manifest in the cache, which may cause the inconsistency of the cache on the different TiDB nodes. +### Flush the cache of one TiDB node -If you do such an operation wrongly, you can use the following command on each TiDB node to load the data of the table fully instead of incrementally: +It is not recommended to directly operate on the `mysql.stats_extended` system table. The direct operation on the table causes inconsistent caches on different TiDB nodes. -{{< copyable "sql" >}} +If you have mistakenly operated on the table, you can use the following statement on each TiDB node. Then the current cache will be cleared and the `mysql.stats_extended` table will be fully reloaded: ```sql ADMIN RELOAD STATS_EXTENDED; ``` -### Collecting the Extended Statistics -After registration, TiDB collects the extended statistic with the `ANALYZE` command manually or automatically, except below scenarios: +## Export and import extended statistics -- Statistics collection on indexes only -- Statistics collection with `ANALYZE INCREMENTAL` command -- Statistics collection with variable `tidb_enable_fast_analyze` is true +The way of exporting or importing extended statistics is the same as the regular statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#import-and-export-statistics) for details. -## The type of the Extended Statistics +## Usage scenarios and examples -### Correlation +Currently, only the correlation type is supported. This type is used to estimate the number of rows in the range query. The following example shows how to use the correlation type to estimate the number of rows in the range query. The registration SQL is like the following: @@ -120,6 +128,4 @@ SELECT * FROM t WHERE col1 <= 1 OR col1 IS NULL; The above estimation plus one will be the final estimation for the condition. This way, we don't need to use the independent assumption to get a significant estimation error. The optimizer will use the independent assumption if the correlation factor is less than the system variable `tidb_opt_correlation_threshold`. But it will increase the estimation heuristically. The larger the system variable `tidb_opt_correlation_exp_factor` is, the larger the estimation result is. The larger the absolute value of the correlation factor is, the larger the estimation result is. -## The dump and load of the Extended Statistics -The way mentioned in the chapter [Introduction to Statistics](/statistics.md) is also suitable for extended statistics. The dump result is in the same JSON file as the normal statistics. From a6a88d83eaa6fb1a913361df2731352be3f04628 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 8 Sep 2022 18:10:32 +0800 Subject: [PATCH 08/24] add details summary display --- extended-statistics.md | 61 +++++++++++++++++++++++++++--------------- 1 file changed, 40 insertions(+), 21 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 1947ca4a46df2..abd6198cdbd09 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -10,11 +10,11 @@ TiDB can collect the following two types of statistics: - Regular statistics: statistics such as histograms and Count-Min Sketch. See [Introduction to Statistics](/statistics.md) for details. - Extended statistics: statistics filtered by tables and columns. -Because the extended statistics are only used for optimizer estimates in specific scenarios, when the `ANALYZE` statement is executed manually or automatically, to reduce the overhead of managing statistics, TiDB only collects the regular statistics and does not collect the extended statistics by default. +When the `ANALYZE` statement is executed manually or automatically, TiDB by default only collects the regular statistics and does not collect the extended statistics. This is because the extended statistics are only used for optimizer estimates in specific scenarios, and collecting them requires additional overhead. -Extended statistics are disabled by default. To collect extended statistics, you need to enable and register the extended statistics first. +Extended statistics are disabled by default. To collect extended statistics, you need to first enable and register the extended statistics. -After the registration, the next time the `ANALYZE` statement is executed manually or automatically, TiDB collects both the regular statistics and the registered extended statistics. +After the registration, the next time the `ANALYZE` statement is executed, TiDB collects both the regular statistics and the registered extended statistics. ## Limitations @@ -22,7 +22,7 @@ Extended statistics are not collected in the following scenarios: - Statistics collection on indexes only - Statistics collection with the `ANALYZE INCREMENTAL` command -- Statistics collection with the value of the `tidb_enable_fast_analyze` system variable set to `true` +- Statistics collection when the value of the system variable `tidb_enable_fast_analyze` is set to `true` ## Common operations @@ -38,25 +38,40 @@ The default value of this variable is `OFF`. ### Register extended statistics -To register the extended statistics, use the SQL statement `ALTER TABLE ADD STATS_EXTENDED`. The syntax is as follows: +To register extended statistics, use the SQL statement `ALTER TABLE ADD STATS_EXTENDED`. The syntax is as follows: ```sql ALTER TABLE table_name ADD STATS_EXTENDED IF NOT EXISTS stats_name stats_type(column_name, column_name...); ``` -In the statement, you can specify the table name, statistics type, statistics type, and column name of the extended statistics to be collected. +In the statement, you can specify the table name, statistics name, statistics type, and column name of the extended statistics to be collected. - `table_name` specifies the name of the table from which the extended statistics are collected. - `stats_name` specifies the name of the statistics, which must be unique for each table. - `stats_type` specifies the type of the statistics. Currently, only the correlation type is supported. - `column_name` specifies the column group, which might have multiple columns. Currently, you can only specify two column names. -Each TiDB node maintains a cache in the system table `mysql.stats_extended` for extended statistics, which improve access performance. After you register the extended statistics, if the system table `mysql.stats_extended` has the corresponding records, the next time the `ANALYZE` statement is executed, TiDB will collect the extended statistics. +
+ How it works -Each row in the `mysql.stats_extended` table records a `version` column. Once a row is updated, the value of the column `version` is increased, so that TiDB can load the table into the memory incrementally instead of fully. +To improve access performance, each TiDB node maintains a cache in the system table `mysql.stats_extended` for extended statistics. After you register the extended statistics, the next time the `ANALYZE` statement is executed, TiDB will collect the extended statistics if the system table `mysql.stats_extended` has the corresponding records. + +Each row in the `mysql.stats_extended` table has a `version` column. Once a row is updated, the value of `version` is increased. In this way, TiDB loads the table into memory incrementally, instead of fully. TiDB loads `mysql.stats_extended` periodically to ensure that the cache is kept the same as the data in the table. +> **Warning:** +> +> It is **NOT RECOMMENDED** to directly operate on the `mysql.stats_extended` system table. Otherwise, inconsistent caches occur on different TiDB nodes. +> +> If you have mistakenly operated on the table, you can use the following statement on each TiDB node. Then the current cache will be cleared and the `mysql.stats_extended` table will be fully reloaded: +> +> ```sql +> ADMIN RELOAD STATS_EXTENDED; +> ``` + +
+ ### Delete extended statistics To delete a record of the extended statistics, use the following statement: @@ -65,31 +80,34 @@ To delete a record of the extended statistics, use the following statement: ALTER TABLE table_name DROP STATS_EXTENDED stats_name; ``` +
+How it works + After you execute the statement, TiDB marks the value of the corresponding record in `mysql.stats_extended`'s column `status` to `2`, which means that the record is deleted, instead of deleting the record directly. Other TiDB nodes will read this change and delete the record in their memory cache. The background garbage collection will delete the record eventually. -### Flush the cache of one TiDB node - -It is not recommended to directly operate on the `mysql.stats_extended` system table. The direct operation on the table causes inconsistent caches on different TiDB nodes. +> **Warning:** +> +> It is **NOT RECOMMENDED** to directly operate on the `mysql.stats_extended` system table. Otherwise, inconsistent caches occur on different TiDB nodes. +> +> If you have mistakenly operated on the table, you can use the following statement on each TiDB node. Then the current cache will be cleared and the `mysql.stats_extended` table will be fully reloaded: +> +> ```sql +> ADMIN RELOAD STATS_EXTENDED; +> ``` -If you have mistakenly operated on the table, you can use the following statement on each TiDB node. Then the current cache will be cleared and the `mysql.stats_extended` table will be fully reloaded: - -```sql -ADMIN RELOAD STATS_EXTENDED; -``` + ## Export and import extended statistics -The way of exporting or importing extended statistics is the same as the regular statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#import-and-export-statistics) for details. +The way of exporting or importing extended statistics is the same as exporting or importing regular statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#import-and-export-statistics) for details. ## Usage scenarios and examples -Currently, only the correlation type is supported. This type is used to estimate the number of rows in the range query. The following example shows how to use the correlation type to estimate the number of rows in the range query. +There are multiple types of extended statistics, and currently, TiDB only supports the correlation type. This type is used to estimate the number of rows in the range query. The following example shows how to use the correlation type to estimate the number of rows in the range query. -The registration SQL is like the following: - -{{< copyable "sql" >}} +After setting `tidb_enable_extended_stats` to `ON`, register the extended statistics: ```sql ALTER TABLE t ADD STATS_EXTENDED s1 correlation(col1, col2); @@ -126,6 +144,7 @@ SELECT * FROM t WHERE col1 <= 1 OR col1 IS NULL; ``` The above estimation plus one will be the final estimation for the condition. This way, we don't need to use the independent assumption to get a significant estimation error. + The optimizer will use the independent assumption if the correlation factor is less than the system variable `tidb_opt_correlation_threshold`. But it will increase the estimation heuristically. The larger the system variable `tidb_opt_correlation_exp_factor` is, the larger the estimation result is. The larger the absolute value of the correlation factor is, the larger the estimation result is. From 01fa819a0203b8e689548b7c80657f304edc3747 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 8 Sep 2022 18:11:18 +0800 Subject: [PATCH 09/24] Update extended-statistics.md --- extended-statistics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extended-statistics.md b/extended-statistics.md index abd6198cdbd09..3503be8c4ab73 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -97,7 +97,7 @@ Other TiDB nodes will read this change and delete the record in their memory cac > ADMIN RELOAD STATS_EXTENDED; > ``` - +
## Export and import extended statistics From d6db83a2738568d9bf6c3d07f7eee80758e5ee34 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 13 Sep 2022 19:15:34 +0800 Subject: [PATCH 10/24] restructure the example section --- extended-statistics.md | 35 +++++++++++++++++++++++------------ 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 3503be8c4ab73..88ed04ac20870 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -105,35 +105,46 @@ The way of exporting or importing extended statistics is the same as exporting o ## Usage scenarios and examples -There are multiple types of extended statistics, and currently, TiDB only supports the correlation type. This type is used to estimate the number of rows in the range query. The following example shows how to use the correlation type to estimate the number of rows in the range query. +There are multiple types of extended statistics. Currently, TiDB only supports the correlation type. This type is used to estimate the number of rows in the range query. The following example shows how to use the correlation type to estimate the number of rows in range queries. -After setting `tidb_enable_extended_stats` to `ON`, register the extended statistics: +### Define the table + +For a table `t` defined as follows: ```sql -ALTER TABLE t ADD STATS_EXTENDED s1 correlation(col1, col2); +CREATE TABLE t(col1 INT, col2 INT, KEY(col1), KEY(col2)); ``` -When we run the `ANALYZE` after the registration, TiDB will calculate the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of the `col` and `col2` of the table `t` and write the record into the table `mysql.stats_extended`. +Suppose that the `col1` and `col2` of the table `t` both obey monotonically increasing constraints in row order, i.e., the values of `col1` and `col2` are strictly correlated in order (the value of the correlation is 1). -It's used to improve TiDB's index selection for the following scenario: - -For a table `t` described below: +### Make an example query {{< copyable "sql" >}} ```sql -CREATE TABLE t(col1 INT, col2 INT, KEY(col1), KEY(col2)); +SELECT * FROM t WHERE col1 > 1 ORDER BY col2 LIMIT 1; ``` -Suppose that the `col1` and `col2` of the table `t` both obey monotonically increasing constraints in row order, i.e., the values of `col1` and `col2` are strictly correlated in order (the value of the correlation is 1): +For the above query, the optimizer has two choices to access the table `t`: -{{< copyable "sql" >}} +- one uses the index on `col1` to access the table and then sorts the result by `col2` to calculate the `Top-1`. +- Another is that access the table by index on `col2` to meet the first row that satisfies `col1 > 1`. The latter's cost mainly depends on how many rows are filtered out when we scan the table in `col2`'s order. + +Usually, the optimizer can only suppose that `col1` and `col2` are independent, leading to a significant estimation error. + +### Register extended statistics + +After setting `tidb_enable_extended_stats` to `ON`, register the extended statistics: ```sql -SELECT * FROM t WHERE col1 > 1 ORDER BY col2 LIMIT 1; +ALTER TABLE t ADD STATS_EXTENDED s1 correlation(col1, col2); ``` -For the above query, the optimizer has two choices to access the table `t`: one uses the index on `col1` to access the table and then sorts the result by `col2` to calculate the `Top-1`. Another is that access the table by index on `col2` to meet the first row that satisfies `col1 > 1`. The latter's cost mainly depends on how many rows are filtered out when we scan the table in `col2`'s order. Usually, the optimizer can only suppose that `col1` and `col2` are independent, leading to a significant estimation error. +When we run the `ANALYZE` after the registration, TiDB will calculate the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of the `col` and `col2` of the table `t` and write the record into the table `mysql.stats_extended`. + +It's used to improve TiDB's index selection for the following scenario: + +### How extended statistics make a difference After the TiDB has the extended statistics for correlation, the optimizer can estimate how many rows we need to scan more precisely. Since the `col1` and `col2` are strictly correlated in order, the optimizer will equivalently translate the row count estimate for option two above into: From a03ad05469828cc64d4313f59872a1633349b4f5 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 14 Sep 2022 17:14:48 +0800 Subject: [PATCH 11/24] finish the restruc --- extended-statistics.md | 44 +++++++++++++++++++----------------------- 1 file changed, 20 insertions(+), 24 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 88ed04ac20870..6c40a8138ebcc 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -44,7 +44,7 @@ To register extended statistics, use the SQL statement `ALTER TABLE ADD STATS_EX ALTER TABLE table_name ADD STATS_EXTENDED IF NOT EXISTS stats_name stats_type(column_name, column_name...); ``` -In the statement, you can specify the table name, statistics name, statistics type, and column name of the extended statistics to be collected. +In the syntax, you can specify the table name, statistics name, statistics type, and column name of the extended statistics to be collected. - `table_name` specifies the name of the table from which the extended statistics are collected. - `stats_name` specifies the name of the statistics, which must be unique for each table. @@ -103,59 +103,55 @@ Other TiDB nodes will read this change and delete the record in their memory cac The way of exporting or importing extended statistics is the same as exporting or importing regular statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#import-and-export-statistics) for details. -## Usage scenarios and examples +## Usage examples -There are multiple types of extended statistics. Currently, TiDB only supports the correlation type. This type is used to estimate the number of rows in the range query. The following example shows how to use the correlation type to estimate the number of rows in range queries. +There are multiple types of extended statistics. Currently, TiDB only supports the correlation type. This type is used to estimate the number of rows in the range query and improve index selection. The following example shows how the correlation type extended statistics to estimate the number of rows in range queries. -### Define the table +### Stage 1. Define the table -For a table `t` defined as follows: +A table `t` is defined as follows: ```sql CREATE TABLE t(col1 INT, col2 INT, KEY(col1), KEY(col2)); ``` -Suppose that the `col1` and `col2` of the table `t` both obey monotonically increasing constraints in row order, i.e., the values of `col1` and `col2` are strictly correlated in order (the value of the correlation is 1). +Suppose that `col1` and `col2` of table `t` both obey monotonically increasing constraints in row order. This means that the values of `col1` and `col2` are strictly correlated in order, and the correlation factor is `1`. -### Make an example query +### Stage 2. Execute an example query without extended statistics -{{< copyable "sql" >}} +Execute the following query without using extended statistics. ```sql SELECT * FROM t WHERE col1 > 1 ORDER BY col2 LIMIT 1; ``` -For the above query, the optimizer has two choices to access the table `t`: +For the execution of the preceding query, the TiDB optimizer has the following options to access table `t`: -- one uses the index on `col1` to access the table and then sorts the result by `col2` to calculate the `Top-1`. -- Another is that access the table by index on `col2` to meet the first row that satisfies `col1 > 1`. The latter's cost mainly depends on how many rows are filtered out when we scan the table in `col2`'s order. +- Uses the index on `col1` to access table `t` and then sorts the result by `col2` to calculate `Top-1`. +- Uses the index on `col2` to meet the first row that satisfies `col1 > 1`. The cost of this access method mainly depends on how many rows are filtered out when TiDB scans the table in `col2`'s order. -Usually, the optimizer can only suppose that `col1` and `col2` are independent, leading to a significant estimation error. +Without extended statistics, the TiDB optimizer only supposes that `col1` and `col2` are independent, which **leads to a significant estimation error**. -### Register extended statistics +### Stage 3. Enable extended statistics -After setting `tidb_enable_extended_stats` to `ON`, register the extended statistics: +Set `tidb_enable_extended_stats` to `ON`, and register the extended statistics: ```sql ALTER TABLE t ADD STATS_EXTENDED s1 correlation(col1, col2); ``` -When we run the `ANALYZE` after the registration, TiDB will calculate the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of the `col` and `col2` of the table `t` and write the record into the table `mysql.stats_extended`. - -It's used to improve TiDB's index selection for the following scenario: +When you execute `ANALYZE` after the registration, TiDB calculates the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of `col` and `col2` of table `t`, and write the record into the `mysql.stats_extended` table. -### How extended statistics make a difference +### Stage 4. See how extended statistics make a difference -After the TiDB has the extended statistics for correlation, the optimizer can estimate how many rows we need to scan more precisely. Since the `col1` and `col2` are strictly correlated in order, the optimizer will equivalently translate the row count estimate for option two above into: +After TiDB has the extended statistics for correlation, the optimizer can estimate how many rows to be scanned more precisely. -{{< copyable "sql" >}} +At this time, for the query in [Stage 2. Execute an example query without extended statistics](#stage-2-execute-an-example-query-without-extended-statistics), `col1` and `col2` are strictly correlated in order. If TiDB accesses table `t` by using the index on `col2` to meet the first row that satisfies `col1 > 1`, the TiDB optimizer will equivalently translate the row count estimation into the following query: ```sql SELECT * FROM t WHERE col1 <= 1 OR col1 IS NULL; ``` -The above estimation plus one will be the final estimation for the condition. This way, we don't need to use the independent assumption to get a significant estimation error. - -The optimizer will use the independent assumption if the correlation factor is less than the system variable `tidb_opt_correlation_threshold`. But it will increase the estimation heuristically. The larger the system variable `tidb_opt_correlation_exp_factor` is, the larger the estimation result is. The larger the absolute value of the correlation factor is, the larger the estimation result is. - +The preceding query result plus one will be the final estimation for the row count. In this way, you do not need to use the independent assumption and **the significant estimation error is avoided**. +If the correlation factor (`1` in this example) is less than the value of the system variable `tidb_opt_correlation_threshold`, the optimizer will use the independent assumption, but it will also increase the estimation heuristically. The larger the value of `tidb_opt_correlation_exp_factor`, the larger the estimation result. The larger the absolute value of the correlation factor, the larger the estimation result. From 7c71ba3ca34a2c3bc93f2d381045898f8abb809a Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 14 Sep 2022 17:17:41 +0800 Subject: [PATCH 12/24] Update system-variables.md Co-authored-by: Lilian Lee --- system-variables.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/system-variables.md b/system-variables.md index a0949c1b1168f..335e7c0b1cdf6 100644 --- a/system-variables.md +++ b/system-variables.md @@ -985,7 +985,7 @@ Constraint checking is always performed in place for pessimistic transactions (d - Scope: SESSION | GLOBAL - Type: Boolean - Default value: `OFF` -- This variable indicates whether TiDB can collect the extended statistic to guide the optimizer. Refer to the chapter [Introduction to Extended Statistics](./extended-statistics.md) for more information. +- This variable indicates whether TiDB can collect the extended statistic to guide the optimizer. Refer to the chapter [Introduction to Extended Statistics](/extended-statistics.md) for more information. ### tidb_enable_fast_analyze From 7818ecaa6ce848648091fb7b80abf36317cdcc64 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 14 Sep 2022 17:18:34 +0800 Subject: [PATCH 13/24] Update extended-statistics.md --- extended-statistics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extended-statistics.md b/extended-statistics.md index 6c40a8138ebcc..66d9bf018b552 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -99,7 +99,7 @@ Other TiDB nodes will read this change and delete the record in their memory cac -## Export and import extended statistics +### Export and import extended statistics The way of exporting or importing extended statistics is the same as exporting or importing regular statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#import-and-export-statistics) for details. From f7bab0c2c2e95a2667ce1c794022737d05d94839 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 14 Sep 2022 17:20:11 +0800 Subject: [PATCH 14/24] Update system-variables.md --- system-variables.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/system-variables.md b/system-variables.md index 335e7c0b1cdf6..f88c5170e8570 100644 --- a/system-variables.md +++ b/system-variables.md @@ -985,7 +985,7 @@ Constraint checking is always performed in place for pessimistic transactions (d - Scope: SESSION | GLOBAL - Type: Boolean - Default value: `OFF` -- This variable indicates whether TiDB can collect the extended statistic to guide the optimizer. Refer to the chapter [Introduction to Extended Statistics](/extended-statistics.md) for more information. +- This variable indicates whether TiDB can collect the extended statistic to guide the optimizer. See [Introduction to Extended Statistics](/extended-statistics.md) for more information. ### tidb_enable_fast_analyze From 3b07dea5377700ca08819f7c88f2ef67a2f4df63 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 15 Sep 2022 15:38:56 +0800 Subject: [PATCH 15/24] address comments --- extended-statistics.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 66d9bf018b552..2bd6a98cd3140 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -12,7 +12,7 @@ TiDB can collect the following two types of statistics: When the `ANALYZE` statement is executed manually or automatically, TiDB by default only collects the regular statistics and does not collect the extended statistics. This is because the extended statistics are only used for optimizer estimates in specific scenarios, and collecting them requires additional overhead. -Extended statistics are disabled by default. To collect extended statistics, you need to first enable and register the extended statistics. +Extended statistics are disabled by default. To collect extended statistics, you need to first enable the extended statistics, and then register each individual extended statistics. After the registration, the next time the `ANALYZE` statement is executed, TiDB collects both the regular statistics and the registered extended statistics. @@ -34,10 +34,12 @@ To enable extended statistics, set the system variable `tidb_enable_extended_sta SET GLOBAL tidb_enable_extended_stats = ON; ``` -The default value of this variable is `OFF`. +The default value of this variable is `OFF`. This setting is a one-time task. ### Register extended statistics +The registration is for individual extended statistics, and you need repeat the registration for each extended statistics. + To register extended statistics, use the SQL statement `ALTER TABLE ADD STATS_EXTENDED`. The syntax is as follows: ```sql @@ -103,9 +105,9 @@ Other TiDB nodes will read this change and delete the record in their memory cac The way of exporting or importing extended statistics is the same as exporting or importing regular statistics. See [Introduction to Statistics - Import and export statistics](/statistics.md#import-and-export-statistics) for details. -## Usage examples +## Usage examples for correlation-type extended statistics -There are multiple types of extended statistics. Currently, TiDB only supports the correlation type. This type is used to estimate the number of rows in the range query and improve index selection. The following example shows how the correlation type extended statistics to estimate the number of rows in range queries. +Currently, TiDB only supports the correlation-type extended statistics. This type is used to estimate the number of rows in the range query and improve index selection. The following example shows how the correlation-type extended statistics are used to estimate the number of rows in a range query. ### Stage 1. Define the table From 071d371f14d62110d2c6f9b2dcc59565c954863f Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 15 Sep 2022 18:14:17 +0800 Subject: [PATCH 16/24] refine --- TOC.md | 1 + extended-statistics.md | 6 +++++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/TOC.md b/TOC.md index ca35cf775288b..340569a591d72 100644 --- a/TOC.md +++ b/TOC.md @@ -218,6 +218,7 @@ - [Overview](/sql-physical-optimization.md) - [Index Selection](/choose-index.md) - [Statistics](/statistics.md) + - [Extended Statistics](/extended-statistics.md) - [Wrong Index Solution](/wrong-index-solution.md) - [Distinct Optimization](/agg-distinct-optimization.md) - [Cost Model](/cost-model.md) diff --git a/extended-statistics.md b/extended-statistics.md index 2bd6a98cd3140..3a825bf136643 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -10,6 +10,10 @@ TiDB can collect the following two types of statistics: - Regular statistics: statistics such as histograms and Count-Min Sketch. See [Introduction to Statistics](/statistics.md) for details. - Extended statistics: statistics filtered by tables and columns. +> **Tip:** +> +> Before reading this document, it is recommended that you read [Introduction to Statistics](/statistics.md) first. + When the `ANALYZE` statement is executed manually or automatically, TiDB by default only collects the regular statistics and does not collect the extended statistics. This is because the extended statistics are only used for optimizer estimates in specific scenarios, and collecting them requires additional overhead. Extended statistics are disabled by default. To collect extended statistics, you need to first enable the extended statistics, and then register each individual extended statistics. @@ -38,7 +42,7 @@ The default value of this variable is `OFF`. This setting is a one-time task. ### Register extended statistics -The registration is for individual extended statistics, and you need repeat the registration for each extended statistics. +The registration for extended statistics is not a one-time task, and you need repeat the registration for each extended statistics. To register extended statistics, use the SQL statement `ALTER TABLE ADD STATS_EXTENDED`. The syntax is as follows: From 960f1edc13a35098fd9783cbb497425b081c7538 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 15 Sep 2022 19:43:32 +0800 Subject: [PATCH 17/24] add custom content --- system-variables.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/system-variables.md b/system-variables.md index 29d4bb17a4359..70465f08d4e06 100644 --- a/system-variables.md +++ b/system-variables.md @@ -1103,11 +1103,23 @@ Constraint checking is always performed in place for pessimistic transactions (d ### tidb_enable_extended_stats + + +> **Note:** +> +> This TiDB variable is not applicable to TiDB Cloud. + + + + + - Scope: SESSION | GLOBAL - Type: Boolean - Default value: `OFF` - This variable indicates whether TiDB can collect the extended statistic to guide the optimizer. See [Introduction to Extended Statistics](/extended-statistics.md) for more information. + + ### tidb_enable_fast_analyze > **Warning:** From ab71359554491169b764506bee0d5097884250ff Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 20 Sep 2022 14:41:25 +0800 Subject: [PATCH 18/24] address comments --- extended-statistics.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 3a825bf136643..606a4a6037893 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -16,7 +16,7 @@ TiDB can collect the following two types of statistics: When the `ANALYZE` statement is executed manually or automatically, TiDB by default only collects the regular statistics and does not collect the extended statistics. This is because the extended statistics are only used for optimizer estimates in specific scenarios, and collecting them requires additional overhead. -Extended statistics are disabled by default. To collect extended statistics, you need to first enable the extended statistics, and then register each individual extended statistics. +Extended statistics are disabled by default. To collect extended statistics, you need to first enable the extended statistics, and then register each individual extended statistics object. After the registration, the next time the `ANALYZE` statement is executed, TiDB collects both the regular statistics and the registered extended statistics. @@ -25,7 +25,7 @@ After the registration, the next time the `ANALYZE` statement is executed, TiDB Extended statistics are not collected in the following scenarios: - Statistics collection on indexes only -- Statistics collection with the `ANALYZE INCREMENTAL` command +- Statistics collection using the `ANALYZE INCREMENTAL` command - Statistics collection when the value of the system variable `tidb_enable_fast_analyze` is set to `true` ## Common operations @@ -38,11 +38,11 @@ To enable extended statistics, set the system variable `tidb_enable_extended_sta SET GLOBAL tidb_enable_extended_stats = ON; ``` -The default value of this variable is `OFF`. This setting is a one-time task. +The default value of this variable is `OFF`. ### Register extended statistics -The registration for extended statistics is not a one-time task, and you need repeat the registration for each extended statistics. +The registration for extended statistics is not a one-time task, and you need repeat the registration for each extended statistics object. To register extended statistics, use the SQL statement `ALTER TABLE ADD STATS_EXTENDED`. The syntax is as follows: @@ -53,14 +53,14 @@ ALTER TABLE table_name ADD STATS_EXTENDED IF NOT EXISTS stats_name stats_type(co In the syntax, you can specify the table name, statistics name, statistics type, and column name of the extended statistics to be collected. - `table_name` specifies the name of the table from which the extended statistics are collected. -- `stats_name` specifies the name of the statistics, which must be unique for each table. +- `stats_name` specifies the name of the statistics object, which must be unique for each table. - `stats_type` specifies the type of the statistics. Currently, only the correlation type is supported. - `column_name` specifies the column group, which might have multiple columns. Currently, you can only specify two column names.
How it works -To improve access performance, each TiDB node maintains a cache in the system table `mysql.stats_extended` for extended statistics. After you register the extended statistics, the next time the `ANALYZE` statement is executed, TiDB will collect the extended statistics if the system table `mysql.stats_extended` has the corresponding records. +To improve access performance, each TiDB node maintains a cache in the system table `mysql.stats_extended` for extended statistics. After you register the extended statistics, the next time the `ANALYZE` statement is executed, TiDB will collect the extended statistics if the system table `mysql.stats_extended` has the corresponding objects. Each row in the `mysql.stats_extended` table has a `version` column. Once a row is updated, the value of `version` is increased. In this way, TiDB loads the table into memory incrementally, instead of fully. @@ -70,7 +70,7 @@ TiDB loads `mysql.stats_extended` periodically to ensure that the cache is kept > > It is **NOT RECOMMENDED** to directly operate on the `mysql.stats_extended` system table. Otherwise, inconsistent caches occur on different TiDB nodes. > -> If you have mistakenly operated on the table, you can use the following statement on each TiDB node. Then the current cache will be cleared and the `mysql.stats_extended` table will be fully reloaded: +> If you have mistakenly operated on the table, you can execute the following statement on each TiDB node. Then the current cache will be cleared and the `mysql.stats_extended` table will be fully reloaded: > > ```sql > ADMIN RELOAD STATS_EXTENDED; @@ -80,7 +80,7 @@ TiDB loads `mysql.stats_extended` periodically to ensure that the cache is kept ### Delete extended statistics -To delete a record of the extended statistics, use the following statement: +To delete an extended statistics object, use the following statement: ```sql ALTER TABLE table_name DROP STATS_EXTENDED stats_name; @@ -89,9 +89,9 @@ ALTER TABLE table_name DROP STATS_EXTENDED stats_name;
How it works -After you execute the statement, TiDB marks the value of the corresponding record in `mysql.stats_extended`'s column `status` to `2`, which means that the record is deleted, instead of deleting the record directly. +After you execute the statement, TiDB marks the value of the corresponding object in `mysql.stats_extended`'s column `status` to `2`, instead of deleting the object directly. -Other TiDB nodes will read this change and delete the record in their memory cache. The background garbage collection will delete the record eventually. +Other TiDB nodes will read this change and delete the object in their memory cache. The background garbage collection will delete the object eventually. > **Warning:** > @@ -140,13 +140,13 @@ Without extended statistics, the TiDB optimizer only supposes that `col1` and `c ### Stage 3. Enable extended statistics -Set `tidb_enable_extended_stats` to `ON`, and register the extended statistics: +Set `tidb_enable_extended_stats` to `ON`, and register the extended statistics object for `col1` and `col2`: ```sql ALTER TABLE t ADD STATS_EXTENDED s1 correlation(col1, col2); ``` -When you execute `ANALYZE` after the registration, TiDB calculates the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of `col` and `col2` of table `t`, and write the record into the `mysql.stats_extended` table. +When you execute `ANALYZE` after the registration, TiDB calculates the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of `col` and `col2` of table `t`, and writes the object into the `mysql.stats_extended` table. ### Stage 4. See how extended statistics make a difference From 36ba3628295c7b47d3aa89d4d4639b8258fb1782 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 20 Sep 2022 15:38:37 +0800 Subject: [PATCH 19/24] add to index --- basic-features.md | 2 +- experimental-features.md | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/basic-features.md b/basic-features.md index 0d85f97fbfbd8..5c6349b17f2c8 100644 --- a/basic-features.md +++ b/basic-features.md @@ -126,7 +126,7 @@ This document lists the features supported in each TiDB version. Note that suppo | ------------------------------------------------------------ | :--: | :--: | :--: | ------------ | :----------: | :----------: | :----------: | :----------: | :----------: | | [CMSketch](/statistics.md) | Disabled by default | Disabled by default | Disabled by default | Disabled by default | Disabled by default | Y | Y | Y | Y | | [Histograms](/statistics.md) | Y | Y | Y | Y | Y | Y | Y | Y | Y | -| Extended statistics (multiple columns) | Experimental | Experimental | Experimental| Experimental | Experimental | Experimental | Experimental | Experimental | N | +| [Extended statistics](/extended-statistics.md) | Experimental | Experimental | Experimental| Experimental | Experimental | Experimental | Experimental | Experimental | N | | [Statistics feedback](/statistics.md#automatic-update) | Deprecated | Deprecated | Deprecated | Deprecated | Experimental | Experimental | Experimental | Experimental | Experimental | | [Automatically update statistics](/statistics.md#automatic-update) | Y | Y | Y | Y | Y | Y | Y | Y | Y | | [Fast Analyze](/system-variables.md#tidb_enable_fast_analyze) | Experimental| Experimental | Experimental | Experimental | Experimental | Experimental | Experimental | Experimental | Experimental | diff --git a/experimental-features.md b/experimental-features.md index 4d0b214208288..2f4abbe25c2bf 100644 --- a/experimental-features.md +++ b/experimental-features.md @@ -16,6 +16,7 @@ This document introduces the experimental features of TiDB in different versions + [Use the thread pool to handle read requests from the storage engine](/tiflash/tiflash-configuration.md#configure-the-tiflashtoml-file). (Introduced in v6.2.0) + [Cost Model Version 2](/cost-model.md#cost-model-version-2). (Introduced in v6.2.0) + [FastScan](/develop/dev-guide-use-fastscan.md). (Introduced in v6.2.0) ++ [Extended statistics](/extended-statistics.md). (Introduced in v5.0.0) + [Randomly sample about 10000 rows of data to quickly build statistics](/system-variables.md#tidb_enable_fast_analyze) (Introduced in v3.0) ## Stability From e162e72562a1914f8758ee4a3050e11a3b9e99f8 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 22 Sep 2022 11:29:25 +0800 Subject: [PATCH 20/24] Apply suggestions from code review Co-authored-by: Lilian Lee --- extended-statistics.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 606a4a6037893..16416066e59f0 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -38,7 +38,7 @@ To enable extended statistics, set the system variable `tidb_enable_extended_sta SET GLOBAL tidb_enable_extended_stats = ON; ``` -The default value of this variable is `OFF`. +The default value of this variable is `OFF`. The setting of this system variable applies to all extended statistics objects. ### Register extended statistics @@ -115,7 +115,7 @@ Currently, TiDB only supports the correlation-type extended statistics. This typ ### Stage 1. Define the table -A table `t` is defined as follows: +Define a table `t` as follows: ```sql CREATE TABLE t(col1 INT, col2 INT, KEY(col1), KEY(col2)); @@ -125,7 +125,7 @@ Suppose that `col1` and `col2` of table `t` both obey monotonically increasing c ### Stage 2. Execute an example query without extended statistics -Execute the following query without using extended statistics. +Execute the following query without using extended statistics: ```sql SELECT * FROM t WHERE col1 > 1 ORDER BY col2 LIMIT 1; From 0eb14b04218d98aefdc107b5244e17a591a436cc Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 22 Sep 2022 11:31:19 +0800 Subject: [PATCH 21/24] Apply suggestions from code review Co-authored-by: Lilian Lee --- extended-statistics.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/extended-statistics.md b/extended-statistics.md index 16416066e59f0..e2e782f6cc465 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -113,7 +113,7 @@ The way of exporting or importing extended statistics is the same as exporting o Currently, TiDB only supports the correlation-type extended statistics. This type is used to estimate the number of rows in the range query and improve index selection. The following example shows how the correlation-type extended statistics are used to estimate the number of rows in a range query. -### Stage 1. Define the table +### Step 1. Define the table Define a table `t` as follows: @@ -123,7 +123,7 @@ CREATE TABLE t(col1 INT, col2 INT, KEY(col1), KEY(col2)); Suppose that `col1` and `col2` of table `t` both obey monotonically increasing constraints in row order. This means that the values of `col1` and `col2` are strictly correlated in order, and the correlation factor is `1`. -### Stage 2. Execute an example query without extended statistics +### Step 2. Execute an example query without extended statistics Execute the following query without using extended statistics: @@ -138,7 +138,7 @@ For the execution of the preceding query, the TiDB optimizer has the following o Without extended statistics, the TiDB optimizer only supposes that `col1` and `col2` are independent, which **leads to a significant estimation error**. -### Stage 3. Enable extended statistics +### Step 3. Enable extended statistics Set `tidb_enable_extended_stats` to `ON`, and register the extended statistics object for `col1` and `col2`: @@ -148,7 +148,7 @@ ALTER TABLE t ADD STATS_EXTENDED s1 correlation(col1, col2); When you execute `ANALYZE` after the registration, TiDB calculates the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) of `col` and `col2` of table `t`, and writes the object into the `mysql.stats_extended` table. -### Stage 4. See how extended statistics make a difference +### Step 4. See how extended statistics make a difference After TiDB has the extended statistics for correlation, the optimizer can estimate how many rows to be scanned more precisely. From 9622bd3ce5938dffcdc20a620713112c6138f7c0 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 22 Sep 2022 13:00:52 +0800 Subject: [PATCH 22/24] Update extended-statistics.md --- extended-statistics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extended-statistics.md b/extended-statistics.md index e2e782f6cc465..614ac02ce1e33 100644 --- a/extended-statistics.md +++ b/extended-statistics.md @@ -152,7 +152,7 @@ When you execute `ANALYZE` after the registration, TiDB calculates the [Pearson After TiDB has the extended statistics for correlation, the optimizer can estimate how many rows to be scanned more precisely. -At this time, for the query in [Stage 2. Execute an example query without extended statistics](#stage-2-execute-an-example-query-without-extended-statistics), `col1` and `col2` are strictly correlated in order. If TiDB accesses table `t` by using the index on `col2` to meet the first row that satisfies `col1 > 1`, the TiDB optimizer will equivalently translate the row count estimation into the following query: +At this time, for the query in [Stage 2. Execute an example query without extended statistics](#step-2-execute-an-example-query-without-extended-statistics), `col1` and `col2` are strictly correlated in order. If TiDB accesses table `t` by using the index on `col2` to meet the first row that satisfies `col1 > 1`, the TiDB optimizer will equivalently translate the row count estimation into the following query: ```sql SELECT * FROM t WHERE col1 <= 1 OR col1 IS NULL; From 3b6bf167c461ed811655777094aeb4a8bb9a5535 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 22 Sep 2022 14:18:15 +0800 Subject: [PATCH 23/24] Update system-variables.md --- system-variables.md | 1 + 1 file changed, 1 insertion(+) diff --git a/system-variables.md b/system-variables.md index d01b8bfa26f38..80be2d3b8f34e 100644 --- a/system-variables.md +++ b/system-variables.md @@ -1177,6 +1177,7 @@ MPP is a distributed computing framework provided by the TiFlash engine, which a - Scope: SESSION | GLOBAL +- Persists to cluster: Yes - Type: Boolean - Default value: `OFF` - This variable indicates whether TiDB can collect the extended statistic to guide the optimizer. See [Introduction to Extended Statistics](/extended-statistics.md) for more information. From 820992a3c8f3fe578ccf2bfb8e8263300eaefb09 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Thu, 22 Sep 2022 17:59:29 +0800 Subject: [PATCH 24/24] add tidb_optimizer_selectivity_level --- system-variables.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/system-variables.md b/system-variables.md index 80be2d3b8f34e..d9195ded281d0 100644 --- a/system-variables.md +++ b/system-variables.md @@ -2520,6 +2520,14 @@ explain select * from t where age=5; - Default value: `OFF` - This variable is used to control whether common table expressions (CTEs) in the entire session are inlined or not. The default value is `OFF`, which means that inlining CTE is not enforced by default. However, you can still inline CTE by specifying the `MERGE()` hint. If the variable is set to `ON`, all CTEs (except recursive CTE) in this session are forced to be inlined. +### tidb_optimizer_selectivity_level + +- Scope: SESSION | GLOBAL +- Persists to cluster: Yes +- Default value: `1` +- Value options: `1` and `2` (not recommended) +- This variable controls the iteration of the optimizer's estimation logic. After changing the value of this variable, the estimation logic of the optimizer will change greatly. Currently, `1` is the only valid value. It is not recommended to set the value to `2`. + ### tidb_partition_prune_mode New in v5.1 - Scope: SESSION | GLOBAL