From c23a790b824997f4f239fdf8d638dbccfe60ba90 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Fri, 8 Oct 2021 14:49:58 +0800 Subject: [PATCH 1/4] tiflash, metric: add alert for TiFlash down --- alert-rules.md | 21 ++++++++++++++++++++- tiflash/tiflash-alert-rules.md | 2 +- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/alert-rules.md b/alert-rules.md index 1ed724b2cf2d1..e4b4ac720340e 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -8,7 +8,7 @@ aliases: ['/docs/dev/alert-rules/','/docs/dev/reference/alert-rules/'] # TiDB Cluster Alert Rules -This document describes the alert rules for different components in a TiDB cluster, including the rule descriptions and solutions of the alert items in TiDB, TiKV, PD, TiDB Binlog, Node_exporter and Blackbox_exporter. +This document describes the alert rules for different components in a TiDB cluster, including the rule descriptions and solutions of the alert items in TiDB, TiKV, PD, TiFlash, TiDB Binlog, Node_exporter and Blackbox_exporter. According to the severity level, alert rules are divided into three categories (from high to low): emergency-level, critical-level, and warning-level. This division of severity levels applies to all alert items of each component below. @@ -781,6 +781,10 @@ This section gives the alert rules for the TiKV component. The speed of splitting Regions is slower than the write speed. To alleviate this issue, you’d better update TiDB to a version that supports batch-split (>= 2.1.0-rc1). If it is not possible to update temporarily, you can use `pd-ctl operator add split-region --policy=approximate` to manually split Regions. +## TiFlash alert rules + +For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](tiflash\tiflash-alert-rules.md). + ## TiDB Binlog alert rules For the detailed descriptions of TiDB Binlog alert rules, see [TiDB Binlog monitoring document](/tidb-binlog/monitor-tidb-binlog-cluster.md#alert-rules). @@ -954,6 +958,21 @@ This section gives the alert rules for the Blackbox_exporter TCP, ICMP, and HTTP * Check whether the TiDB process exists. * Check whether the network between the monitoring machine and the TiDB machine is normal. +#### `TiFlash_server_is_down` + +* Alert rule: + + `probe_success{group="tiflash"} == 0` + +* Description: + Failure to probe the TiFlash service port. + +* Solution: + + * Check whether the machine that provides the TiFlash service is down. + * Check whether the TiFlash process exists. + * Check whether the network between the monitoring machine and the TiFlash machine is normal. + #### `Pump_server_is_down` * Alert rule: diff --git a/tiflash/tiflash-alert-rules.md b/tiflash/tiflash-alert-rules.md index 4f9a6d1f423fc..e6750760f2f9c 100644 --- a/tiflash/tiflash-alert-rules.md +++ b/tiflash/tiflash-alert-rules.md @@ -34,7 +34,7 @@ This document introduces the alert rules of the TiFlash cluster. - Solution: - It might be caused by the internal problems of the TiFlash TMT engine. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. + It might be caused by the internal problems of the TiFlash storage engine. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. ## `TiFlash_raft_read_index_duration` From f9ab09b11e987dff1a5e5f0d89110e56e88099eb Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Fri, 8 Oct 2021 16:57:50 +0800 Subject: [PATCH 2/4] Update alert-rules.md --- alert-rules.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/alert-rules.md b/alert-rules.md index e4b4ac720340e..fcf96c5e6d89d 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -783,7 +783,7 @@ This section gives the alert rules for the TiKV component. ## TiFlash alert rules -For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](tiflash\tiflash-alert-rules.md). +For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md). ## TiDB Binlog alert rules From 2225ac708bafa5ed8a9651809647f4c00d6a9389 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Fri, 8 Oct 2021 18:05:14 +0800 Subject: [PATCH 3/4] Update alert-rules.md --- alert-rules.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/alert-rules.md b/alert-rules.md index fcf96c5e6d89d..fc7b39d3591bb 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -783,7 +783,7 @@ This section gives the alert rules for the TiKV component. ## TiFlash alert rules -For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md). +For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](\tiflash\tiflash-alert-rules.md). ## TiDB Binlog alert rules @@ -965,6 +965,7 @@ This section gives the alert rules for the Blackbox_exporter TCP, ICMP, and HTTP `probe_success{group="tiflash"} == 0` * Description: + Failure to probe the TiFlash service port. * Solution: From 62e6e15d8d9f0b78b87240d8443cd3b4d58c0893 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Fri, 8 Oct 2021 18:49:14 +0800 Subject: [PATCH 4/4] Update alert-rules.md --- alert-rules.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/alert-rules.md b/alert-rules.md index fc7b39d3591bb..b6d482271238e 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -783,7 +783,7 @@ This section gives the alert rules for the TiKV component. ## TiFlash alert rules -For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](\tiflash\tiflash-alert-rules.md). +For the detailed descriptions of TiFlash alert rules, see [TiFlash Alert Rules](/tiflash/tiflash-alert-rules.md). ## TiDB Binlog alert rules