From 1d8510756b5fd735aa55336c27f7d19af468394b Mon Sep 17 00:00:00 2001 From: Lynn Date: Mon, 29 Jun 2020 14:21:57 +0800 Subject: [PATCH 1/4] *: update grafana-tidb-dashboard --- grafana-tidb-dashboard.md | 146 ++++++++++++++++++++++---------------- 1 file changed, 86 insertions(+), 60 deletions(-) diff --git a/grafana-tidb-dashboard.md b/grafana-tidb-dashboard.md index 3d5ee453f891a..f26057bc4f61c 100644 --- a/grafana-tidb-dashboard.md +++ b/grafana-tidb-dashboard.md @@ -7,9 +7,12 @@ aliases: ['/docs/dev/grafana-tidb-dashboard/','/docs/dev/reference/key-monitorin # TiDB Monitoring Metrics -If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see [TiDB Monitoring Framework Overview](/tidb-monitoring-framework.md). +When deploying a TiDB cluster using TiDB Ansible or TiUP, a one-click deployment monitoring system (Prometheus & Grafana) is used. For the monitoring architecture, see [TiDB Monitoring Framework Overview](/tidb-monitoring-framework.md). -The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. The TiDB dashboard consists of the TiDB panel and the TiDB Summary panel, and what displays on the TiDB Summary panel also displays on the TiDB panel. A lot of metrics are there to help you diagnose. +The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. The TiDB dashboard consists of the TiDB panel and the TiDB Summary panel. The differences between the two panels are as follows: + +- TiDB Panel: provide as comprehensive information as possible for troubleshooting cluster abnormalities. +- TiDB Summary Panel: extract the parts of the TiDB panel that the user is most concerned about and make some modifications. It is mainly used to provide data (such as QPS, TPS, response delay) that users care about in the daily operation of the database, so as to be used as monitoring information for external display and reporting. This document describes some key monitoring metrics displayed on the TiDB dashboard. @@ -18,53 +21,67 @@ This document describes some key monitoring metrics displayed on the TiDB dashbo To understand the key metrics displayed on the TiDB dashboard, check the following list: - Query Summary - - Duration: the execution time of a SQL statement - - QPS: the statistics of OKs and Errors according to the SQL execution result on each TiDB instance - - Statement OPS: the statistics of executed SQL statements (including `SELECT`, `INSERT`, `UPDATE` and so on) - - QPS By Instance: the QPS on each TiDB instance - - Failed Query OPM: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors happening when executing SQL statements on each TiDB instance + - Duration: execution time + - The time when the client's network request is sent to TiDB and returned to the client after TiDB is executed. In general, client requests are sent in the form of SQL statements, but can also include the execution time of commands such as `COM_PING`, `COM_SLEEP`, `COM_STMT_FETCH`, `COM_SEND_LONG_DATA` + - Since TiDB supports Multi-Query, it can accept multiple SQL statements sent by the client at one time, such as `select 1; select 1; select 1;`. At this time, the statistical execution time is the total time after the execution of all SQL statements + - QPS: the number of SQL statements executed per second on all TiDB instances. According to the success or failure of the execution result (OK/Error) to distinguish. + - Statement OPS: the number of different types of SQL statements executed per second. According to `SELECT`, `INSERT`, `UPDATE` and other types of statistics + - QPS By Instance: the QPS on each TiDB instance, which is classified according to the success or failure of command execution results + - Failed Query OPM: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors happening when executing SQL statements per second on each TiDB instance. It contains the module to which the error belongs and the error code - Slow query: the statistics of the processing time of slow queries (the time cost of the entire slow query, the time cost of Coprocessor,and the waiting time for Coprocessor scheduling) - 999/99/95/80 Duration: the statistics of the execution time for different types of SQL statements (different percentiles) - Query Detail - Duration 80/95/99/999 By Instance: the statistics of the execution time for SQL statements on each TiDB instance (different percentiles) - - Failed Query OPM Detail: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors happening when executing SQL statements in the entire cluster - - Internal SQL OPS: the statistics of the executed SQL statements within TiDB + - Failed Query OPM Detail: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors happening when executing SQL statements on each TiDB instance + - Internal SQL OPS: the QPS executed by internal SQL statements in the entire TiDB cluster. Internal SQL statements are automatically executed SQL statements within TiDB. It is generally triggered by user SQL statements or internally scheduled tasks. - Server - - Uptime: the runtime of TiDB - - Memory Usage: the statistics of memory usage of different TiDB instances - - CPU Usage: the statistics of CPU usage of different TiDB instances + - Uptime: the runtime of each TiDB instance + - Memory Usage: the memory usage statistics of each TiDB instance are divided into the memory occupied by the process and the memory applied by Golang on the heap + - CPU Usage: the statistics of CPU usage for each TiDB instance - Connection Count: the number of clients connected to each TiDB instance - - Open FD Count: the statistics of opened file descriptors of different TiDB instances - - Goroutine Count: the number of Goroutines of different TiDB instances - - Go GC Duration: the statistics of GC time of different TiDB instances - - Go Threads: the number of threads of different TiDB instances - - Go GC Count: the number of times that GC is executed on different TiDB instances - - Go GC CPU Usage: the statistics of GC CPU usage of different TiDB instances + - Open FD Count: the statistics of opened file descriptors on each TiDB instance + - Goroutine Count: the number of Goroutines on each TiDB instance + - Go GC Duration: Golang GC time on each TiDB instance + - Go Threads: the number of threads on each TiDB instance + - Go GC Count: the number of times that Golang GC is executed on each TiDB instance + - Go GC CPU Usage: the statistics of CPU used by Golang GC for each TiDB instance - Events OPM: the statistics of key events, such as "start", "close", "graceful-shutdown","kill", "hang", and so on - - Keep Alive OPM: the number of times that the metrics are refreshed every minute on different TiDB instances + - Keep Alive OPM: the number of times that the metrics are refreshed every minute on each TiDB instance. It usually needs no attention. - Prepare Statement Count: the number of `Prepare` statements that are executed on each TiDB instance and the total count of them - - Time Jump Back OPS: the number of times that the time rewinds every second on different TiDB instances - - Heap Memory Usage: the heap memory size used by each TiDB instance - - Uncommon Error OPM: the statistics of abnormal TiDB errors, including panic, binlog write failure, and so on - - Handshake Error OPS: the number of times that a handshake error occurs every second on different TiDB instances - - Get Token Duration: the time cost of getting Token after establishing the connection + - Time Jump Back OPS: the number of times that the time of operating system rewinds every second on each TiDB instance + - Write Binlog Error:the number of times that binlog failed to write every second on each TiDB instance + - Handshake Error OPS: the number of times that a handshake error occurs every second on each TiDB instance + - Get Token Duration: the time cost of getting Token on each connection - Transaction - - Transaction OPS: the statistics of executed transactions + - Transaction OPS: the number of transactions executed per second - Duration: the execution time of a transaction - - Transaction Retry Num: the number of times that a transaction retries - Transaction Statement Num: the number of SQL statements in a transaction - - Session Retry Error OPS: the number of errors encountered during the transaction retry - - Local Latch Wait Duration: the waiting time of a local transaction + - Transaction Retry Num: the number of times that a transaction retries + - Session Retry Error OPS: the number of errors encountered during the transaction retry per second + - KV Transaction OPS: the number of transactions executed per second within each TiDB + - A user's transaction may trigger multiple transaction executions within TiDB, including internal metadata reading, user transaction atomic retry execution, and so on + - TiDB's internal scheduled tasks will also operate the database through transactions, this part is also included in this panel + - KV Transaction Duration: the time spent on executing transactions within each TiDB + - Commit Token Wait Duration: the flow control queue takes time to wait when the transaction is committed. When there is a long wait, it means that the committed transaction is too large and is limiting the current. If the system still has resources available, you can speed up the committing by increasing the `committer-concurrency` in the TiDB configuration file + - Transaction Max Write KV Num: the maximum number of key-value pairs written by a single transaction + - Transaction Max Write Size Bytes: the maximum key-value pair size written by a single transaction + - Transaction Regions Num 90: 90% of the number of Regions written by a single transaction + - Send HeartBeat Duration: the interval between transactions sending heartbeats + - TTL Lifetime Reach Counter: the TTL of the transaction reached the upper limit. The default value of the upper limit of TTL is 10 minutes. It means that the first lock of a pessimistic transaction or the first prewrite of an optimistic transaction exceeds 10 minutes. The default value of the upper limit of TTL is 10 minutes. The upper limit of TTL life can be changed by modifying `max-txn-TTL` in the TiDB configuration file + - Statement Lock Keys: the number of locks for a single statement + - Acquire Pessimistic Locks Duration: the time consumed by locking + - Pessimistic Statement Retry OPS: the number of retry attempts for pessimistic statements. When the statement tries to lock, it may encounter a write conflict error. At this time, the statement will reacquire a new snapshot and lock again + - Load Safepoint OPS: the number of times that `Safepoint` loads. The function of `Safepoint` is to ensure that the data before `Safepoint` is not read when the transaction reads data, so as to ensure the safety of the data. Because the data before `Safepoint` may be cleaned up by the GC - Executor - Parse Duration: the statistics of the parsing time of SQL statements - - Compile Duration: the statistics of the time of compiling an SQL AST to the execution plan + - Compile Duration: the statistics of the time of compiling the parsed SQL AST to the execution plan - Execution Duration: the statistics of the execution time for SQL statements - - Expensive Executor OPS: the statistics of the operators that consume many system resources, including `Merge Join`, `Hash Join`, `Index Look Up Join`, `Hash Agg`, `Stream Agg`, `Sort`, `TopN`, and so on - - Queries Using Plan Cache OPS: the statistics of queries using the Plan Cache + - Expensive Executor OPS: the statistics of the operators that consume many system resources per second, including `Merge Join`, `Hash Join`, `Index Look Up Join`, `Hash Agg`, `Stream Agg`, `Sort`, `TopN`, and so on + - Queries Using Plan Cache OPS: the statistics of queries using the Plan Cache per second - Distsql - Distsql Duration: the processing time of Distsql statements @@ -75,69 +92,78 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Partial Num: the number of Partial results for each SQL statement - KV Errors - - KV Retry Duration: the time that a KV retry request lasts + - KV Backoff Duration: the time that a KV retry request lasts - TiClient Region Error OPS: the number of Region related error messages returned by TiKV - - KV Backoff OPS: the number of error messages (transaction conflicts and so on) returned by TiKV - - Lock Resolve OPS: the number of errors related to transaction conflicts - - Other Errors OPS: the number of other types of errors, including clearing locks and updating SafePoint + - KV Backoff OPS: the number of error messages returned by TiKV + - Lock Resolve OPS: the number of TiDB cleanup lock operations. When TiDB's read-write request encounters a lock, it will try to clean up the lock + - Other Errors OPS: the number of other types of errors, including clearing locks and updating `SafePoint` - KV Duration - - KV Request Duration 999 by store: the execution time of a KV request, displayed according to TiKV - - KV Request Duration 999 by type: the execution time of a KV request, displayed according to the request type - - KV Cmd Duration 99/999: the execution time of KV commands - -- KV Count - - KV Cmd OPS: the statistics of executed KV commands - - KV Txn OPS: the statistics of started transactions - - Txn Regions Num 90: the statistics of Regions used by the transaction - - Txn Write Size Bytes 100: the statistics of bytes written by the transaction - - Txn Write KV Num 100: the statistics of KVs written by the transaction - - Load SafePoint OPS: the statistics of operations that update SafePoint + - KV Request OPS: the execution times of a KV request, displayed according to TiKV + - KV Request Duration 99 by store: the execution time of a KV request, displayed according to TiKV + - KV Request Duration 99 by type: the execution time of a KV request, displayed according to the request type - PD Client - - PD Client CMD OPS: the statistics of commands executed by PD Client + - PD Client CMD OPS: the statistics of commands executed by PD Client per second - PD Client CMD Duration: the time it takes PD Client to execute commands - - PD Client CMD Fail OPS: the statistics of failed commands executed by PD Client - - PD TSO OPS: the number of TSO that TiDB obtains from PD - - PD TSO Wait Duration: the time it takes TiDB to obtain TSO from PD - - PD TSO RPC Duration: the time it takes TiDB to obtain TSO gRPC interface from PD + - PD Client CMD Fail OPS: the statistics of failed commands executed by PD Client per second + - PD TSO OPS: the number of TSO that TiDB obtains from PD per second + - PD TSO Wait Duration: the time that TiDB waits to return to TSO from PD + - PD TSO RPC duration: the time taken by TiDB from sending a request for TSO to PD to receive TSO + - Start TSO Wait Duration: the time from TiDB sending PD to get start TSO request to waiting for TSO to return - Schema Load - Load Schema Duration: the time it takes TiDB to obtain the schema from TiKV - - Load Schema OPS: the statistics of the schemas that TiDB obtains from TiKV - - Schema Lease Error OPM: the Schema Lease error, including two types named "change" and "outdate"; an alarm is triggered when an "outdate" error occurs + - Load Schema OPS: the statistics of the schemas that TiDB obtains from TiKV per second + - Schema Lease Error OPM: the Schema Lease errors include two types named `change` and `outdate`. `change` means that the schema has changed, and `outdate` means that the schema cannot be updated, which is a more serious error. It will alarm when an `outdate` error occurs + - Load Privilege OPS: the statistics of the number of privilege information obtained by TiDB from TiKV per second - DDL - - DDL Duration 95: the statistics of DDL statements processing time - - Batch Add Index Duration 100: the statistics of the time that it takes each Batch to create the index + - DDL Duration 95: 95% quantile of DDL statement processing time + - Batch Add Index Duration 100: statistics of the maximum time spent by each Batch when creating an index - DDL Waiting Jobs Count: the number of DDL tasks that are waiting - DDL META OPM: the number of times that a DDL obtains META every minute + - DDL Worker Duration 99: 99% of the execution time of each DDL worker - Deploy Syncer Duration: the time consumed by Schema Version Syncer initialization, restart, and clearing up operations - Owner Handle Syncer Duration: the time that it takes the DDL Owner to update, obtain, and check the Schema Version - Update Self Version Duration: the time consumed by updating the version information of Schema Version Syncer + - DDL OPM: the number of executions per second of DDL statements + - DDL Add Index Progress In Percentage: the progress of adding an index - Statistics - Auto Analyze Duration 95: the time consumed by automatic `ANALYZE` - Auto Analyze QPS: the statistics of automatic `ANALYZE` - Stats Inaccuracy Rate: the information of the statistics inaccuracy rate - Pseudo Estimation OPS: the number of the SQL statements optimized using pseudo statistics - - Dump Feedback OPS: the number of stored statistical Feedbacks - - Update Stats OPS: the statistics of using Feedback to update the statistics information - - Significant Feedback: the number of significant Feedback pieces that update the statistics information + - Dump Feedback OPS: the number of stored statistical feedbacks + - Store Query Feedback QPS: the number of operations per second to store the feedback information of the union query, which is performed in TiDB memory + - Significant Feedback: the number of significant feedback pieces that update the statistics information + - Update Stats OPS: the number of updating statistics with feedback + - Fast Analyze Status 100: the status for quickly collecting statistical information + +- Owner + - New ETCD Session Duration 95: the time it takes to create a new etcd session. TiDB connects to etcd in PD through etcd client to save/read some metadata information. This records the time spent creating the session + - Owner Watcher OPS: the number of operations per second of DDL owner watches PD's etcd metadata - Meta - AutoID QPS: AutoID related statistics, including three operations (global ID allocation, a single table AutoID allocation, a single table AutoID Rebase) - AutoID Duration: the time consumed by AutoID related operations + - Region Cache Error OPS: the number of errors encountered per second by TiDB cached region information - Meta Operations Duration 99: the latency of Meta operations - GC - - Worker Action OPM: the statistics of GC related operations, including `run_job`, `resolve_lock`, and `delete\_range` + - Worker Action OPM: the number of GC related operations, including `run_job`, `resolve_lock`, and `delete\_range` - Duration 99: the time consumed by GC related operations + - Config: the configuration of GC data life time and GC run interval - GC Failure OPM: the number of failed GC related operations - - Action Result OPM: the number of results of GC-related operations + - Delete Range Failure OPM: the number of times the `Delete Range` failed - Too Many Locks Error OPM: the number of the error that GC clears up too many locks + - Action Result OPM: the number of results of GC-related operations + - Delete Range Task Status: the task status of `Delete Range`, including completion and failure status + - Push Task Duration 95: the time spent pushing GC subtasks to GC workers - Batch Client - Pending Request Count by TiKV: the number of Batch messages that are pending processing - Wait Duration 95: the waiting time of Batch messages that are pending processing - Batch Client Unavailable Duration 95: the unavailable time of the Batch client + - No Available Connection Counter: the number of times the Batch client cannot find an available link From 81d61059a1ae19061d51c616917f643afd2bd272 Mon Sep 17 00:00:00 2001 From: Lynn Date: Mon, 13 Jul 2020 17:05:26 +0800 Subject: [PATCH 2/4] *: address comments --- grafana-tidb-dashboard.md | 80 +++++++++++++++++++-------------------- 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/grafana-tidb-dashboard.md b/grafana-tidb-dashboard.md index f26057bc4f61c..945cc7c000b13 100644 --- a/grafana-tidb-dashboard.md +++ b/grafana-tidb-dashboard.md @@ -7,12 +7,12 @@ aliases: ['/docs/dev/grafana-tidb-dashboard/','/docs/dev/reference/key-monitorin # TiDB Monitoring Metrics -When deploying a TiDB cluster using TiDB Ansible or TiUP, a one-click deployment monitoring system (Prometheus & Grafana) is used. For the monitoring architecture, see [TiDB Monitoring Framework Overview](/tidb-monitoring-framework.md). +If you use TiDB Ansible or TiUP to deploy the TiDB cluster, the monitoring system (Prometheus & Grafana) is deployed at the same time. For the monitoring architecture, see [TiDB Monitoring Framework Overview](/tidb-monitoring-framework.md). -The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. The TiDB dashboard consists of the TiDB panel and the TiDB Summary panel. The differences between the two panels are as follows: +The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. The TiDB dashboard consists of the TiDB panel and the TiDB Summary panel. The differences between the two panels are different in the following aspects: -- TiDB Panel: provide as comprehensive information as possible for troubleshooting cluster abnormalities. -- TiDB Summary Panel: extract the parts of the TiDB panel that the user is most concerned about and make some modifications. It is mainly used to provide data (such as QPS, TPS, response delay) that users care about in the daily operation of the database, so as to be used as monitoring information for external display and reporting. +- TiDB panel: provides as comprehensive information as possible for troubleshooting cluster anomalies. +- TiDB Summary Panel: extracts parts of the TiDB panel information with which users are most concerned, with some modifications. It provides data (such as QPS, TPS, response delay) that users care about in the daily database operations, which serve as the monitoring information to be displayed or reported. This document describes some key monitoring metrics displayed on the TiDB dashboard. @@ -22,26 +22,26 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Query Summary - Duration: execution time - - The time when the client's network request is sent to TiDB and returned to the client after TiDB is executed. In general, client requests are sent in the form of SQL statements, but can also include the execution time of commands such as `COM_PING`, `COM_SLEEP`, `COM_STMT_FETCH`, `COM_SEND_LONG_DATA` - - Since TiDB supports Multi-Query, it can accept multiple SQL statements sent by the client at one time, such as `select 1; select 1; select 1;`. At this time, the statistical execution time is the total time after the execution of all SQL statements - - QPS: the number of SQL statements executed per second on all TiDB instances. According to the success or failure of the execution result (OK/Error) to distinguish. - - Statement OPS: the number of different types of SQL statements executed per second. According to `SELECT`, `INSERT`, `UPDATE` and other types of statistics + - The duration between the time that the client's network request is sent to TiDB and the time that the request is returned to the client after TiDB has executed it. In general, client requests are sent in the form of SQL statements, but can also include the execution time of commands such as `COM_PING`, `COM_SLEEP`, `COM_STMT_FETCH`, and `COM_SEND_LONG_DATA` + - Because TiDB supports Multi-Query, it supports sending multiple SQL statements at one time, such as `select 1; select 1; select 1;`. In this case, the total execution time of this query includes the execution time of all SQL statements + - QPS: the number of SQL statements executed per second on all TiDB instances. The execution results are classified into `OK` (successful) and `Error` (failed) + - Statement OPS: the number of different types of SQL statements executed per second, which is counted according to `SELECT`, `INSERT`, `UPDATE`, and other types of statements - QPS By Instance: the QPS on each TiDB instance, which is classified according to the success or failure of command execution results - - Failed Query OPM: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors happening when executing SQL statements per second on each TiDB instance. It contains the module to which the error belongs and the error code + - Failed Query OPM: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors occurred when executing SQL statements per second on each TiDB instance. It contains the module in which the error occurs and the error code - Slow query: the statistics of the processing time of slow queries (the time cost of the entire slow query, the time cost of Coprocessor,and the waiting time for Coprocessor scheduling) - 999/99/95/80 Duration: the statistics of the execution time for different types of SQL statements (different percentiles) - Query Detail - Duration 80/95/99/999 By Instance: the statistics of the execution time for SQL statements on each TiDB instance (different percentiles) - - Failed Query OPM Detail: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors happening when executing SQL statements on each TiDB instance - - Internal SQL OPS: the QPS executed by internal SQL statements in the entire TiDB cluster. Internal SQL statements are automatically executed SQL statements within TiDB. It is generally triggered by user SQL statements or internally scheduled tasks. + - Failed Query OPM Detail: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors occurred when executing SQL statements on each TiDB instance + - Internal SQL OPS: the internal SQL statements executed per second in the entire TiDB cluster. The internal SQL statements are automatically executed in. There are generally triggered by user SQL statements or internally scheduled tasks. - Server - Uptime: the runtime of each TiDB instance - - Memory Usage: the memory usage statistics of each TiDB instance are divided into the memory occupied by the process and the memory applied by Golang on the heap - - CPU Usage: the statistics of CPU usage for each TiDB instance + - Memory Usage: the memory usage statistics of each TiDB instance, which is are divided into the memory occupied by processes and the memory applied by Golang on the heap + - CPU Usage: the statistics of CPU usage of each TiDB instance - Connection Count: the number of clients connected to each TiDB instance - - Open FD Count: the statistics of opened file descriptors on each TiDB instance + - Open FD Count: the statistics of opened file descriptors of each TiDB instance - Goroutine Count: the number of Goroutines on each TiDB instance - Go GC Duration: Golang GC time on each TiDB instance - Go Threads: the number of threads on each TiDB instance @@ -50,8 +50,8 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Events OPM: the statistics of key events, such as "start", "close", "graceful-shutdown","kill", "hang", and so on - Keep Alive OPM: the number of times that the metrics are refreshed every minute on each TiDB instance. It usually needs no attention. - Prepare Statement Count: the number of `Prepare` statements that are executed on each TiDB instance and the total count of them - - Time Jump Back OPS: the number of times that the time of operating system rewinds every second on each TiDB instance - - Write Binlog Error:the number of times that binlog failed to write every second on each TiDB instance + - Time Jump Back OPS: the number of times that the operating system rewinds every second on each TiDB instance + - Write Binlog Error:the number of times that the binlog write failure occurs every second on each TiDB instance - Handshake Error OPS: the number of times that a handshake error occurs every second on each TiDB instance - Get Token Duration: the time cost of getting Token on each connection @@ -61,20 +61,20 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Transaction Statement Num: the number of SQL statements in a transaction - Transaction Retry Num: the number of times that a transaction retries - Session Retry Error OPS: the number of errors encountered during the transaction retry per second - - KV Transaction OPS: the number of transactions executed per second within each TiDB - - A user's transaction may trigger multiple transaction executions within TiDB, including internal metadata reading, user transaction atomic retry execution, and so on - - TiDB's internal scheduled tasks will also operate the database through transactions, this part is also included in this panel + - KV Transaction OPS: the number of transactions executed per second within each TiDB instance + - A user transaction might trigger multiple transaction executions in TiDB, including reading internal metadata, atomic retries of the user transaction, and so on + - TiDB's internal scheduled tasks also operates on the database through transactions, which is also included in this panel - KV Transaction Duration: the time spent on executing transactions within each TiDB - - Commit Token Wait Duration: the flow control queue takes time to wait when the transaction is committed. When there is a long wait, it means that the committed transaction is too large and is limiting the current. If the system still has resources available, you can speed up the committing by increasing the `committer-concurrency` in the TiDB configuration file + - Commit Token Wait Duration: the wait duration in the flow control queue during the transaction commit. If the wait duration is long, it means that the transaction to commit is too large and the flow is controlled. If the system still has resources available, you can speed up the committing by increasing the `committer-concurrency` value in the TiDB configuration file - Transaction Max Write KV Num: the maximum number of key-value pairs written by a single transaction - Transaction Max Write Size Bytes: the maximum key-value pair size written by a single transaction - Transaction Regions Num 90: 90% of the number of Regions written by a single transaction - - Send HeartBeat Duration: the interval between transactions sending heartbeats - - TTL Lifetime Reach Counter: the TTL of the transaction reached the upper limit. The default value of the upper limit of TTL is 10 minutes. It means that the first lock of a pessimistic transaction or the first prewrite of an optimistic transaction exceeds 10 minutes. The default value of the upper limit of TTL is 10 minutes. The upper limit of TTL life can be changed by modifying `max-txn-TTL` in the TiDB configuration file + - Send HeartBeat Duration: the interval at which transactions send heartbeats + - TTL Lifetime Reach Counter: the number of transactions that reach the upper limit of TTL. The default value of the TTL upper limit is 10 minutes. It means that 10 minutes have passed since the first lock of a pessimistic transaction or the first prewrite of an optimistic transaction. The default value of the upper limit of TTL is 10 minutes. The upper limit of TTL life can be changed by modifying `max-txn-TTL` in the TiDB configuration file - Statement Lock Keys: the number of locks for a single statement - - Acquire Pessimistic Locks Duration: the time consumed by locking - - Pessimistic Statement Retry OPS: the number of retry attempts for pessimistic statements. When the statement tries to lock, it may encounter a write conflict error. At this time, the statement will reacquire a new snapshot and lock again - - Load Safepoint OPS: the number of times that `Safepoint` loads. The function of `Safepoint` is to ensure that the data before `Safepoint` is not read when the transaction reads data, so as to ensure the safety of the data. Because the data before `Safepoint` may be cleaned up by the GC + - Acquire Pessimistic Locks Duration: the time consumed by adding locks + - Pessimistic Statement Retry OPS: the number of retry attempts for pessimistic statements. When the statement tries to add lock, it might encounter a write conflict. At this time, the statement will acquire a new snapshot and add lock again + - Load Safepoint OPS: the number of times that `Safepoint` is loaded. `Safepoint` is to ensure that the data before `Safepoint` is not read when the transaction reads data, thus ensuring data safety. The data before `Safepoint` might be cleaned up by the GC - Executor - Parse Duration: the statistics of the parsing time of SQL statements @@ -92,42 +92,42 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Partial Num: the number of Partial results for each SQL statement - KV Errors - - KV Backoff Duration: the time that a KV retry request lasts + - KV Backoff Duration: the total time that a KV retry request lasts. TiDB may encounter an error when sending a request to TiKV. TiDB has a retry mechanism for every request to TiKV. The total time of a request retry is recorded here - TiClient Region Error OPS: the number of Region related error messages returned by TiKV - KV Backoff OPS: the number of error messages returned by TiKV - - Lock Resolve OPS: the number of TiDB cleanup lock operations. When TiDB's read-write request encounters a lock, it will try to clean up the lock + - Lock Resolve OPS: the number of TiDB operations to resolve locks. When TiDB's read or write request encounters a lock, it tries to resolve the lock - Other Errors OPS: the number of other types of errors, including clearing locks and updating `SafePoint` -- KV Duration +- KV Request - KV Request OPS: the execution times of a KV request, displayed according to TiKV - KV Request Duration 99 by store: the execution time of a KV request, displayed according to TiKV - KV Request Duration 99 by type: the execution time of a KV request, displayed according to the request type - PD Client - PD Client CMD OPS: the statistics of commands executed by PD Client per second - - PD Client CMD Duration: the time it takes PD Client to execute commands + - PD Client CMD Duration: the time it takes for PD Client to execute commands - PD Client CMD Fail OPS: the statistics of failed commands executed by PD Client per second - PD TSO OPS: the number of TSO that TiDB obtains from PD per second - - PD TSO Wait Duration: the time that TiDB waits to return to TSO from PD - - PD TSO RPC duration: the time taken by TiDB from sending a request for TSO to PD to receive TSO - - Start TSO Wait Duration: the time from TiDB sending PD to get start TSO request to waiting for TSO to return + - PD TSO Wait Duration: the time that TiDB waits for PD to return TSO + - PD TSO RPC duration: the duration from the time that TiDB sends request to PD (to get TSO) to the time that TiDB receives TSO + - Start TSO Wait Duration: the duration from the time that TiDB sends request to PD (to get `start TSO`) to the time that TiDB receives `start TSO` - Schema Load - Load Schema Duration: the time it takes TiDB to obtain the schema from TiKV - Load Schema OPS: the statistics of the schemas that TiDB obtains from TiKV per second - - Schema Lease Error OPM: the Schema Lease errors include two types named `change` and `outdate`. `change` means that the schema has changed, and `outdate` means that the schema cannot be updated, which is a more serious error. It will alarm when an `outdate` error occurs + - Schema Lease Error OPM: the Schema Lease errors include two types: `change` and `outdate`. `change` means that the schema has changed, and `outdate` means that the schema cannot be updated, which is a more serious error and triggers an alert. - Load Privilege OPS: the statistics of the number of privilege information obtained by TiDB from TiKV per second - DDL - DDL Duration 95: 95% quantile of DDL statement processing time - - Batch Add Index Duration 100: statistics of the maximum time spent by each Batch when creating an index + - Batch Add Index Duration 100: statistics of the maximum time spent by each Batch on creating an index - DDL Waiting Jobs Count: the number of DDL tasks that are waiting - DDL META OPM: the number of times that a DDL obtains META every minute - - DDL Worker Duration 99: 99% of the execution time of each DDL worker + - DDL Worker Duration 99: 99% quantile of the execution time of each DDL worker - Deploy Syncer Duration: the time consumed by Schema Version Syncer initialization, restart, and clearing up operations - Owner Handle Syncer Duration: the time that it takes the DDL Owner to update, obtain, and check the Schema Version - Update Self Version Duration: the time consumed by updating the version information of Schema Version Syncer - - DDL OPM: the number of executions per second of DDL statements + - DDL OPM: the number of DDL executions per second - DDL Add Index Progress In Percentage: the progress of adding an index - Statistics @@ -138,7 +138,7 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Dump Feedback OPS: the number of stored statistical feedbacks - Store Query Feedback QPS: the number of operations per second to store the feedback information of the union query, which is performed in TiDB memory - Significant Feedback: the number of significant feedback pieces that update the statistics information - - Update Stats OPS: the number of updating statistics with feedback + - Update Stats OPS: the number of operations of updating statistics with feedback - Fast Analyze Status 100: the status for quickly collecting statistical information - Owner @@ -148,7 +148,7 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Meta - AutoID QPS: AutoID related statistics, including three operations (global ID allocation, a single table AutoID allocation, a single table AutoID Rebase) - AutoID Duration: the time consumed by AutoID related operations - - Region Cache Error OPS: the number of errors encountered per second by TiDB cached region information + - Region Cache Error OPS: the number of errors encountered per second by the cached Region information in TiDB - Meta Operations Duration 99: the latency of Meta operations - GC @@ -156,10 +156,10 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Duration 99: the time consumed by GC related operations - Config: the configuration of GC data life time and GC run interval - GC Failure OPM: the number of failed GC related operations - - Delete Range Failure OPM: the number of times the `Delete Range` failed + - Delete Range Failure OPM: the number of times the `Delete Range` has failed - Too Many Locks Error OPM: the number of the error that GC clears up too many locks - Action Result OPM: the number of results of GC-related operations - - Delete Range Task Status: the task status of `Delete Range`, including completion and failure status + - Delete Range Task Status: the task status of `Delete Range`, including completion and failure - Push Task Duration 95: the time spent pushing GC subtasks to GC workers - Batch Client From 06f08347832c5c82a00c33941117daf806309c90 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Mon, 13 Jul 2020 20:26:25 +0800 Subject: [PATCH 3/4] Apply suggestions from code review --- grafana-tidb-dashboard.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/grafana-tidb-dashboard.md b/grafana-tidb-dashboard.md index 945cc7c000b13..59a8775df757a 100644 --- a/grafana-tidb-dashboard.md +++ b/grafana-tidb-dashboard.md @@ -12,7 +12,7 @@ If you use TiDB Ansible or TiUP to deploy the TiDB cluster, the monitoring syste The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. The TiDB dashboard consists of the TiDB panel and the TiDB Summary panel. The differences between the two panels are different in the following aspects: - TiDB panel: provides as comprehensive information as possible for troubleshooting cluster anomalies. -- TiDB Summary Panel: extracts parts of the TiDB panel information with which users are most concerned, with some modifications. It provides data (such as QPS, TPS, response delay) that users care about in the daily database operations, which serve as the monitoring information to be displayed or reported. +- TiDB Summary Panel: extracts parts of the TiDB panel information with which users are most concerned, with some modifications. It provides data (such as QPS, TPS, response delay) that users care about in the daily database operations, which serves as the monitoring information to be displayed or reported. This document describes some key monitoring metrics displayed on the TiDB dashboard. @@ -34,11 +34,11 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Query Detail - Duration 80/95/99/999 By Instance: the statistics of the execution time for SQL statements on each TiDB instance (different percentiles) - Failed Query OPM Detail: the statistics of error types (such as syntax errors and primary key conflicts) according to the errors occurred when executing SQL statements on each TiDB instance - - Internal SQL OPS: the internal SQL statements executed per second in the entire TiDB cluster. The internal SQL statements are automatically executed in. There are generally triggered by user SQL statements or internally scheduled tasks. + - Internal SQL OPS: the internal SQL statements executed per second in the entire TiDB cluster. The internal SQL statements are internally executed and are generally triggered by user SQL statements or internally scheduled tasks. - Server - Uptime: the runtime of each TiDB instance - - Memory Usage: the memory usage statistics of each TiDB instance, which is are divided into the memory occupied by processes and the memory applied by Golang on the heap + - Memory Usage: the memory usage statistics of each TiDB instance, which is divided into the memory occupied by processes and the memory applied by Golang on the heap - CPU Usage: the statistics of CPU usage of each TiDB instance - Connection Count: the number of clients connected to each TiDB instance - Open FD Count: the statistics of opened file descriptors of each TiDB instance @@ -63,9 +63,9 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Session Retry Error OPS: the number of errors encountered during the transaction retry per second - KV Transaction OPS: the number of transactions executed per second within each TiDB instance - A user transaction might trigger multiple transaction executions in TiDB, including reading internal metadata, atomic retries of the user transaction, and so on - - TiDB's internal scheduled tasks also operates on the database through transactions, which is also included in this panel + - TiDB's internally scheduled tasks also operate on the database through transactions, which are also included in this panel - KV Transaction Duration: the time spent on executing transactions within each TiDB - - Commit Token Wait Duration: the wait duration in the flow control queue during the transaction commit. If the wait duration is long, it means that the transaction to commit is too large and the flow is controlled. If the system still has resources available, you can speed up the committing by increasing the `committer-concurrency` value in the TiDB configuration file + - Commit Token Wait Duration: the wait duration in the flow control queue during the transaction commit. If the wait duration is long, it means that the transaction to commit is too large and the flow is controlled. If the system still has resources available, you can speed up the commit process by increasing the `committer-concurrency` value in the TiDB configuration file - Transaction Max Write KV Num: the maximum number of key-value pairs written by a single transaction - Transaction Max Write Size Bytes: the maximum key-value pair size written by a single transaction - Transaction Regions Num 90: 90% of the number of Regions written by a single transaction @@ -92,7 +92,7 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Partial Num: the number of Partial results for each SQL statement - KV Errors - - KV Backoff Duration: the total time that a KV retry request lasts. TiDB may encounter an error when sending a request to TiKV. TiDB has a retry mechanism for every request to TiKV. The total time of a request retry is recorded here + - KV Backoff Duration: the total duration that a KV retry request lasts. TiDB might encounter an error when sending a request to TiKV. TiDB has a retry mechanism for every request to TiKV. This `KV Backoff Duration` item records the total time of a request retry. - TiClient Region Error OPS: the number of Region related error messages returned by TiKV - KV Backoff OPS: the number of error messages returned by TiKV - Lock Resolve OPS: the number of TiDB operations to resolve locks. When TiDB's read or write request encounters a lock, it tries to resolve the lock @@ -154,7 +154,7 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - GC - Worker Action OPM: the number of GC related operations, including `run_job`, `resolve_lock`, and `delete\_range` - Duration 99: the time consumed by GC related operations - - Config: the configuration of GC data life time and GC run interval + - Config: the configuration of GC data life time and GC running interval - GC Failure OPM: the number of failed GC related operations - Delete Range Failure OPM: the number of times the `Delete Range` has failed - Too Many Locks Error OPM: the number of the error that GC clears up too many locks From dbd663b79956e42e802971cf5905474ba92328d9 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 14 Jul 2020 13:45:37 +0800 Subject: [PATCH 4/4] Update grafana-tidb-dashboard.md --- grafana-tidb-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tidb-dashboard.md b/grafana-tidb-dashboard.md index 59a8775df757a..309fb410de64c 100644 --- a/grafana-tidb-dashboard.md +++ b/grafana-tidb-dashboard.md @@ -143,7 +143,7 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi - Owner - New ETCD Session Duration 95: the time it takes to create a new etcd session. TiDB connects to etcd in PD through etcd client to save/read some metadata information. This records the time spent creating the session - - Owner Watcher OPS: the number of operations per second of DDL owner watches PD's etcd metadata + - Owner Watcher OPS: the number of Goroutine operations per second of DDL owner watch PD's etcd metadata - Meta - AutoID QPS: AutoID related statistics, including three operations (global ID allocation, a single table AutoID allocation, a single table AutoID Rebase)