From e7d840fbef53819cfeffdfcf639ab086f79e1c2e Mon Sep 17 00:00:00 2001 From: leiysky Date: Thu, 21 May 2020 13:30:29 +0800 Subject: [PATCH 1/5] move troubleshoot part to new doc --- tiflash/maintain-tiflash.md | 64 ------------------------------ tiflash/troubleshoot-tiflash.md | 70 +++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+), 64 deletions(-) create mode 100644 tiflash/troubleshoot-tiflash.md diff --git a/tiflash/maintain-tiflash.md b/tiflash/maintain-tiflash.md index 0eb27af2fd56b..23579431415bb 100644 --- a/tiflash/maintain-tiflash.md +++ b/tiflash/maintain-tiflash.md @@ -102,70 +102,6 @@ To manually delete the replication rules in PD, take the following steps: curl -v -X DELETE http://:/pd/api/v1/config/rule/tiflash/table-45-r ``` -## TiFlash troubleshooting - -This section describes some commonly encountered issues when using TiFlash, the reasons, and the solutions. - -### TiFlash replica is always unavailable - -This is because TiFlash is in an abnormal state caused by configuration errors or environment issues. Take the following steps to identify the faulty component: - -1. Check whether PD enables the `Placement Rules` feature (to enable the feature, see the step 2 of [Add TiFlash component to an existing TiDB cluster](/tiflash/deploy-tiflash.md#add-tiflash-component-to-an-existing-tidb-cluster): - - {{< copyable "shell-regular" >}} - - ```shell - echo 'config show replication' | /path/to/pd-ctl -u http://: - ``` - - The expected result is `"enable-placement-rules": "true"`. - -2. Check whether the TiFlash process is working correctly by viewing `UpTime` on the TiFlash-Summary monitoring panel. - -3. Check whether the TiFlash proxy status is normal through `pd-ctl`. - - {{< copyable "shell-regular" >}} - - ```shell - echo "store" | /path/to/pd-ctl -u http://: - ``` - - The TiFlash proxy's `store.labels` includes information such as `{"key": "engine", "value": "tiflash"}`. You can check this information to confirm a TiFlash proxy. - -4. Check whether `pd buddy` can correctly print the logs (the log path is the value of `log` in the [flash.flash_cluster] configuration item; the default log path is under the `tmp` directory configured in the TiFlash configuration file). - -5. Check whether the value of `max-replicas` in PD is less than or equal to the number of TiKV nodes in the cluster. If not, PD cannot replicate data to TiFlash: - - {{< copyable "shell-regular" >}} - - ```shell - echo 'config show replication' | /path/to/pd-ctl -u http://: - ``` - - Reconfirm the value of `max-replicas`. - -6. Check whether the remaining disk space of the machine (where `store` of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of the `store` capacity (which is controlled by the `low-space-ratio` parameter), PD cannot schedule data to this TiFlash node. - -### TiFlash query time is unstable, and the error log prints many `Lock Exception` messages - -This is because large amounts of data are written to the cluster, which causes that the TiFlash query encounters a lock and requires query retry. - -You can set the query timestamp to one second earlier in TiDB. For example, if the current time is '2020-04-08 20:15:01', you can execute `set @@tidb_snapshot='2020-04-08 20:15:00';` before you execute the query. This makes less TiFlash queries encounter a lock and mitigates the risk of unstable query time. - -### Some queries return the `Region Unavailable` error - -If the load pressure on TiFlash is too heavy and it causes that TiFlash data replication falls behind, some queries might return the `Region Unavailable` error. - -In this case, you can balance the load pressure by adding more TiFlash nodes. - -### Data file corruption - -Take the following steps to handle the data file corruption: - -1. Refer to [Take a TiFlash node down](#take-a-tiflash-node-down) to take the corresponding TiFlash node down. -2. Delete the related data of the TiFlash node. -3. Redeploy the TiFlash node in the cluster. - ## TiFlash critical logs | Log Information | Log Description | diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md new file mode 100644 index 0000000000000..ec7bd8fa7e02b --- /dev/null +++ b/tiflash/troubleshoot-tiflash.md @@ -0,0 +1,70 @@ +--- +title: Troubleshoot a TiFlash cluseter +summary: Learn common operations when you troubleshoot a TiFlash cluster. +category: reference +aliases: ['/docs/dev/reference/tiflash/troubleshoot/'] +--- + +## TiFlash troubleshooting + +This section describes some commonly encountered issues when using TiFlash, the reasons, and the solutions. + +### TiFlash replica is always unavailable + +This is because TiFlash is in an abnormal state caused by configuration errors or environment issues. Take the following steps to identify the faulty component: + +1. Check whether PD enables the `Placement Rules` feature (to enable the feature, see the step 2 of [Add TiFlash component to an existing TiDB cluster](/tiflash/deploy-tiflash.md#add-tiflash-component-to-an-existing-tidb-cluster): + + {{< copyable "shell-regular" >}} + + ```shell + echo 'config show replication' | /path/to/pd-ctl -u http://: + ``` + + The expected result is `"enable-placement-rules": "true"`. + +2. Check whether the TiFlash process is working correctly by viewing `UpTime` on the TiFlash-Summary monitoring panel. + +3. Check whether the TiFlash proxy status is normal through `pd-ctl`. + + {{< copyable "shell-regular" >}} + + ```shell + echo "store" | /path/to/pd-ctl -u http://: + ``` + + The TiFlash proxy's `store.labels` includes information such as `{"key": "engine", "value": "tiflash"}`. You can check this information to confirm a TiFlash proxy. + +4. Check whether `pd buddy` can correctly print the logs (the log path is the value of `log` in the [flash.flash_cluster] configuration item; the default log path is under the `tmp` directory configured in the TiFlash configuration file). + +5. Check whether the value of `max-replicas` in PD is less than or equal to the number of TiKV nodes in the cluster. If not, PD cannot replicate data to TiFlash: + + {{< copyable "shell-regular" >}} + + ```shell + echo 'config show replication' | /path/to/pd-ctl -u http://: + ``` + + Reconfirm the value of `max-replicas`. + +6. Check whether the remaining disk space of the machine (where `store` of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of the `store` capacity (which is controlled by the `low-space-ratio` parameter), PD cannot schedule data to this TiFlash node. + +### TiFlash query time is unstable, and the error log prints many `Lock Exception` messages + +This is because large amounts of data are written to the cluster, which causes that the TiFlash query encounters a lock and requires query retry. + +You can set the query timestamp to one second earlier in TiDB. For example, if the current time is '2020-04-08 20:15:01', you can execute `set @@tidb_snapshot='2020-04-08 20:15:00';` before you execute the query. This makes less TiFlash queries encounter a lock and mitigates the risk of unstable query time. + +### Some queries return the `Region Unavailable` error + +If the load pressure on TiFlash is too heavy and it causes that TiFlash data replication falls behind, some queries might return the `Region Unavailable` error. + +In this case, you can balance the load pressure by adding more TiFlash nodes. + +### Data file corruption + +Take the following steps to handle the data file corruption: + +1. Refer to [Take a TiFlash node down](#take-a-tiflash-node-down) to take the corresponding TiFlash node down. +2. Delete the related data of the TiFlash node. +3. Redeploy the TiFlash node in the cluster. From 81decc0e813a18ffa6c1976efc68599529695b05 Mon Sep 17 00:00:00 2001 From: leiysky Date: Thu, 21 May 2020 15:26:42 +0800 Subject: [PATCH 2/5] fix head --- tiflash/troubleshoot-tiflash.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md index ec7bd8fa7e02b..f46e1ddfe0bb4 100644 --- a/tiflash/troubleshoot-tiflash.md +++ b/tiflash/troubleshoot-tiflash.md @@ -1,15 +1,15 @@ --- -title: Troubleshoot a TiFlash cluseter +title: Troubleshoot a TiFlash Cluster summary: Learn common operations when you troubleshoot a TiFlash cluster. category: reference aliases: ['/docs/dev/reference/tiflash/troubleshoot/'] --- -## TiFlash troubleshooting +# Troubleshoot a TiFlash Cluster This section describes some commonly encountered issues when using TiFlash, the reasons, and the solutions. -### TiFlash replica is always unavailable +## TiFlash replica is always unavailable This is because TiFlash is in an abnormal state caused by configuration errors or environment issues. Take the following steps to identify the faulty component: @@ -49,19 +49,19 @@ This is because TiFlash is in an abnormal state caused by configuration errors o 6. Check whether the remaining disk space of the machine (where `store` of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of the `store` capacity (which is controlled by the `low-space-ratio` parameter), PD cannot schedule data to this TiFlash node. -### TiFlash query time is unstable, and the error log prints many `Lock Exception` messages +## TiFlash query time is unstable, and the error log prints many `Lock Exception` messages This is because large amounts of data are written to the cluster, which causes that the TiFlash query encounters a lock and requires query retry. You can set the query timestamp to one second earlier in TiDB. For example, if the current time is '2020-04-08 20:15:01', you can execute `set @@tidb_snapshot='2020-04-08 20:15:00';` before you execute the query. This makes less TiFlash queries encounter a lock and mitigates the risk of unstable query time. -### Some queries return the `Region Unavailable` error +## Some queries return the `Region Unavailable` error If the load pressure on TiFlash is too heavy and it causes that TiFlash data replication falls behind, some queries might return the `Region Unavailable` error. In this case, you can balance the load pressure by adding more TiFlash nodes. -### Data file corruption +## Data file corruption Take the following steps to handle the data file corruption: From f414b680f76d3b0033ba024be9925ffa414841f1 Mon Sep 17 00:00:00 2001 From: leiysky Date: Thu, 21 May 2020 16:27:40 +0800 Subject: [PATCH 3/5] remove aliases --- tiflash/troubleshoot-tiflash.md | 1 - 1 file changed, 1 deletion(-) diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md index f46e1ddfe0bb4..c246d5d09c747 100644 --- a/tiflash/troubleshoot-tiflash.md +++ b/tiflash/troubleshoot-tiflash.md @@ -2,7 +2,6 @@ title: Troubleshoot a TiFlash Cluster summary: Learn common operations when you troubleshoot a TiFlash cluster. category: reference -aliases: ['/docs/dev/reference/tiflash/troubleshoot/'] --- # Troubleshoot a TiFlash Cluster From bcb5d1d8b3b0a2aefe80fe5cf9229152d01aa27b Mon Sep 17 00:00:00 2001 From: leiysky Date: Thu, 21 May 2020 17:56:25 +0800 Subject: [PATCH 4/5] add document to TOC --- TOC.md | 1 + 1 file changed, 1 insertion(+) diff --git a/TOC.md b/TOC.md index 9498a94de02ab..186773c6ad005 100644 --- a/TOC.md +++ b/TOC.md @@ -335,6 +335,7 @@ - [Configure TiFlash](/tiflash/tiflash-configuration.md) - [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md) - [Tune TiFlash Performance](/tiflash/tune-tiflash-performance.md) + - [Troubleshoot a TiFlash Cluster](/tiflash/troubleshoot-tiflash.md) - [FAQ](/tiflash/tiflash-faq.md) + TiDB Binlog - [Overview](/tidb-binlog/tidb-binlog-overview.md) From b03162135d037deededd8c3fcbc0bf8e69d5099e Mon Sep 17 00:00:00 2001 From: leiysky Date: Thu, 21 May 2020 17:56:37 +0800 Subject: [PATCH 5/5] fix --- tiflash/maintain-tiflash.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tiflash/maintain-tiflash.md b/tiflash/maintain-tiflash.md index 23579431415bb..56dbc77524497 100644 --- a/tiflash/maintain-tiflash.md +++ b/tiflash/maintain-tiflash.md @@ -7,7 +7,7 @@ aliases: ['/docs/dev/reference/tiflash/maintain/'] # Maintain a TiFlash Cluster -This document describes how to perform common operations when you maintain a TiFlash cluster, including checking the TiFlash version, taking TiFlash nodes down, and troubleshooting TiFlash. This document also introduces critical logs and a system table of TiFlash. +This document describes how to perform common operations when you maintain a TiFlash cluster, including checking the TiFlash version, and taking TiFlash nodes down. This document also introduces critical logs and a system table of TiFlash. ## Check the TiFlash version