diff --git a/TOC.md b/TOC.md index 7f0c38aaab576..e071510e489bd 100644 --- a/TOC.md +++ b/TOC.md @@ -317,6 +317,7 @@ - [Scale TiFlash](/reference/tiflash/scale.md) - [Upgrade TiFlash Nodes](/reference/tiflash/upgrade.md) - [Configure TiFlash](/reference/tiflash/configuration.md) + - [TiFlash Alert Rules](/reference/tiflash/alert-rules.md) - [Tune TiFlash Performance](/reference/tiflash/tune-performance.md) - [FAQ](/reference/tiflash/faq.md) + TiDB Binlog diff --git a/reference/tiflash/alert-rules.md b/reference/tiflash/alert-rules.md new file mode 100644 index 0000000000000..e16f16c9bbb63 --- /dev/null +++ b/reference/tiflash/alert-rules.md @@ -0,0 +1,69 @@ +--- +title: TiFlash Alert Rules +summary: Learn the alert rules of the TiFlash cluster. +category: reference +--- + +# TiFlash Alert Rules + +This document introduces the alert rules of the TiFlash cluster. + +## `TiFlash_schema_error` + +- Alert rule: + + `increase(tiflash_schema_apply_count{type="failed"}[15m]) > 0` + +- Description: + + When the schema apply error occurs, an alert is triggered. + +- Solution: + + The error might be caused by some wrong logic. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. + +## `TiFlash_schema_apply_duration` + +- Alert rule: + + `histogram_quantile(0.99, sum(rate(tiflash_schema_apply_duration_seconds_bucket[1m])) BY (le, instance)) > 20` + +- Description: + + When the probability that the apply duration exceeds 20 seconds is over 99%, an alert is triggered. + +- Solution: + + It might be caused by the internal problems of the TiFlash TMT engine. Contact [TiFlash R&D](mailto:support@pingcap.com) for support. + +## `TiFlash_raft_read_index_duration` + +- Alert rule: + + `histogram_quantile(0.99, sum(rate(tiflash_raft_read_index_duration_seconds_bucket[1m])) BY (le, instance)) > 3` + +- Description: + + When the probability that the read index duration exceeds 3 seconds is over 99%, an alert is triggered. + + > **Note:** + > + > `read index` is the kvproto request sent to the TiKV leader. TiKV region retries, busy store, or network problems might lead to long request time of `read index`. + +- Solution: + + The frequent retries might be caused by frequent splitting or migration of the TiKV cluster. You can check the TiKV cluster status to identify the retry reason. + +## `TiFlash_raft_wait_index_duration` + +- Alert rule: + + `histogram_quantile(0.99, sum(rate(tiflash_raft_wait_index_duration_seconds_bucket[1m])) BY (le, instance)) > 2` + +- Description: + + When the probability that the waiting time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%, an alert is triggered. + +- Solution: + + It might be caused by a communication error between TiKV and the proxy. Contact [TiFlash R&D](mailto:support@pingcap.com) for support.