From a41041b42a627eed2cb1a43df4e985b2721c2026 Mon Sep 17 00:00:00 2001 From: toutdesuite Date: Wed, 8 Apr 2020 09:50:07 +0800 Subject: [PATCH 1/3] add the alert-rules.md --- TOC.md | 1 + reference/tiflash/alert-rules.md | 69 ++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) create mode 100644 reference/tiflash/alert-rules.md diff --git a/TOC.md b/TOC.md index 4d3b48547e1d1..e5f0c9f92243f 100644 --- a/TOC.md +++ b/TOC.md @@ -307,6 +307,7 @@ - [Overview](/reference/tiflash/overview.md) - [Deploy a TiFlash Cluster](/reference/tiflash/deploy.md) - [Use TiFlash](/reference/tiflash/use-tiflash.md) + - [TiFlash Alert Rules](/reference/tiflash/alert-rules.md) + TiDB Binlog - [Overview](/reference/tidb-binlog/overview.md) - [Deploy](/reference/tidb-binlog/deploy.md) diff --git a/reference/tiflash/alert-rules.md b/reference/tiflash/alert-rules.md new file mode 100644 index 0000000000000..78723b437c2f0 --- /dev/null +++ b/reference/tiflash/alert-rules.md @@ -0,0 +1,69 @@ +--- +title: TiFlash Alert Rules +summary: Learn the alert rules of the TiFlash cluster. +category: reference +--- + +# TiFlash Alert Rules + +This documents introduces the alert rules of the TiFlash cluster. + +## `TiFlash_schema_error` + +- Alert rule: + + `increase(tiflash_schema_apply_count{type="failed"}[15m]) > 0` + +- Rule description: + + You can get an alert when the schema apply error occurs. + +- How to handle: + + The error might be caused by the logical problem. Get in touch with the TiFlash R&D. + +## `TiFlash_schema_apply_duration` + +- Alert rule: + + `histogram_quantile(0.99, sum(rate(tiflash_schema_apply_duration_seconds_bucket[1m])) BY (le, instance)) > 20` + +- Rule description: + + You can get an alert when the probability that the apply duration exceeds 20 seconds is over 99%. + +- How to handle: + + It might be caused by the internal problems of the TiFlash TMT engine. Get in touch with the TiFlash R&D. + +## `TiFlash_raft_read_index_duration` + +- Alert rule: + + `histogram_quantile(0.99, sum(rate(tiflash_raft_read_index_duration_seconds_bucket[1m])) BY (le, instance)) > 3` + +- Rule description: + + You can get an alert when the probability that the read index duration exceeds 3 seconds is over 99%. + + > **Note:** + > + > `read index` is the kvproto request sent to the TiKV leader. TiKV region retries, busy Store, or network problems might lead to long request time of read index. + +- How to handle: + + The frequent retries might be caused by frequent TiKV cluster split events or frequent TiKV cluster migrations. You can check the TiKV cluster status to identify the retry reason. + +## `TiFlash_raft_wait_index_duration` + +- Alert rule: + + `histogram_quantile(0.99, sum(rate(tiflash_raft_wait_index_duration_seconds_bucket[1m])) BY (le, instance)) > 2` + +- Rule description: + + You can get an alert when the probability that the wait time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%. + +- How to handle: + + It might be caused by communications problems between TiKV and Proxy. Get in touch with the TiFlash R&D. From 23ad083bbe94ad303876c57ceb5fe27dfa3c37ce Mon Sep 17 00:00:00 2001 From: toutdesuite Date: Fri, 10 Apr 2020 11:12:17 +0800 Subject: [PATCH 2/3] address comments --- reference/tiflash/alert-rules.md | 36 ++++++++++++++++---------------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/reference/tiflash/alert-rules.md b/reference/tiflash/alert-rules.md index 78723b437c2f0..d041f25e6e7b5 100644 --- a/reference/tiflash/alert-rules.md +++ b/reference/tiflash/alert-rules.md @@ -6,7 +6,7 @@ category: reference # TiFlash Alert Rules -This documents introduces the alert rules of the TiFlash cluster. +This document introduces the alert rules of the TiFlash cluster. ## `TiFlash_schema_error` @@ -14,13 +14,13 @@ This documents introduces the alert rules of the TiFlash cluster. `increase(tiflash_schema_apply_count{type="failed"}[15m]) > 0` -- Rule description: +- Description: - You can get an alert when the schema apply error occurs. + When the schema apply error occurs, an alert is triggered. -- How to handle: +- Solution: - The error might be caused by the logical problem. Get in touch with the TiFlash R&D. + The error might be caused by some wrong logic. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. ## `TiFlash_schema_apply_duration` @@ -28,13 +28,13 @@ This documents introduces the alert rules of the TiFlash cluster. `histogram_quantile(0.99, sum(rate(tiflash_schema_apply_duration_seconds_bucket[1m])) BY (le, instance)) > 20` -- Rule description: +- Description: - You can get an alert when the probability that the apply duration exceeds 20 seconds is over 99%. + When the probability that the apply duration exceeds 20 seconds is over 99%, an alert is triggered. -- How to handle: +- Solution: - It might be caused by the internal problems of the TiFlash TMT engine. Get in touch with the TiFlash R&D. + It might be caused by the internal problems of the TiFlash TMT engine. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. ## `TiFlash_raft_read_index_duration` @@ -42,17 +42,17 @@ This documents introduces the alert rules of the TiFlash cluster. `histogram_quantile(0.99, sum(rate(tiflash_raft_read_index_duration_seconds_bucket[1m])) BY (le, instance)) > 3` -- Rule description: +- Description: - You can get an alert when the probability that the read index duration exceeds 3 seconds is over 99%. + When the probability that the read index duration exceeds 3 seconds is over 99%, an alert is triggered. > **Note:** > - > `read index` is the kvproto request sent to the TiKV leader. TiKV region retries, busy Store, or network problems might lead to long request time of read index. + > `read index` is the kvproto request sent to the TiKV leader. TiKV region retries, busy store, or network problems might lead to long request time of `read index`. -- How to handle: +- Solution: - The frequent retries might be caused by frequent TiKV cluster split events or frequent TiKV cluster migrations. You can check the TiKV cluster status to identify the retry reason. + The frequent retries might be caused by frequent splitting or migration of the TiKV cluster. You can check the TiKV cluster status to identify the retry reason. ## `TiFlash_raft_wait_index_duration` @@ -60,10 +60,10 @@ This documents introduces the alert rules of the TiFlash cluster. `histogram_quantile(0.99, sum(rate(tiflash_raft_wait_index_duration_seconds_bucket[1m])) BY (le, instance)) > 2` -- Rule description: +- Description: - You can get an alert when the probability that the wait time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%. + When the probability that the wait time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%, an alert is triggered. -- How to handle: +- Solution: - It might be caused by communications problems between TiKV and Proxy. Get in touch with the TiFlash R&D. + It might be caused by a communications error between TiKV and the proxy. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. From 812fe66aabaa08e16d62be8bc981499dd7fb838b Mon Sep 17 00:00:00 2001 From: Keke Yi <40977455+yikeke@users.noreply.github.com> Date: Fri, 10 Apr 2020 15:57:57 +0800 Subject: [PATCH 3/3] Apply suggestions from code review --- reference/tiflash/alert-rules.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/reference/tiflash/alert-rules.md b/reference/tiflash/alert-rules.md index d041f25e6e7b5..e16f16c9bbb63 100644 --- a/reference/tiflash/alert-rules.md +++ b/reference/tiflash/alert-rules.md @@ -62,8 +62,8 @@ This document introduces the alert rules of the TiFlash cluster. - Description: - When the probability that the wait time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%, an alert is triggered. + When the probability that the waiting time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%, an alert is triggered. - Solution: - It might be caused by a communications error between TiKV and the proxy. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. + It might be caused by a communication error between TiKV and the proxy. Contact [TiFlash R&D](mailto:support@pingcap.com) for support.