diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index cbd19d9eee469..944aac64dd792 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -6,234 +6,415 @@ aliases: ['/docs/stable/grafana-tikv-dashboard/','/docs/v4.0/grafana-tikv-dashbo # Key Monitoring Metrics of TiKV -If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see [Overview of the Monitoring Framework](/tidb-monitoring-framework.md). - -The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose. - -You can get an overview of the component TiKV status from the TiKV dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics. - -## Key metrics description - -To understand the key metrics displayed on the Overview dashboard, check the following table: - -Service | Panel name | Description | Normal range ----------------- | ---------------- | ---------------------------------- | -------------- -Cluster | Store size | The storage size per TiKV instance | -Cluster | Available size | The available capacity per TiKV instance | -Cluster | Capacity size | The capacity size per TiKV instance | -Cluster | CPU | The CPU usage per TiKV instance | -Cluster | Memory | The memory usage per TiKV instance | -Cluster | IO utilization | The I/O utilization per TiKV instance | -Cluster | MBps | The total bytes of read and write in each TiKV instance | -Cluster | QPS | The QPS per command in each TiKV instance | -Cluster | Errors-gRPC | The total number of gRPC message failures | -Cluster | Leaders | The number of leaders per TiKV instance | -Cluster | Regions | The number of Regions per TiKV instance | -Errors | Server is busy | Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, Scheduler Busy, and Coprocessor Full| -Errors | Server message failures | The number of failed messages between TiKV instances | It should be `0` in normal case. -Errors | Raftstore errors | The number of Raftstore errors per type on each TiKV instance | -Errors | Scheduler errors | The number of scheduler errors per type on each TiKV instance | -Errors | Coprocessor errors | The number of coprocessor errors per type on each TiKV instance | -Errors | gRPC message errors | The number of gRPC message errors per type on each TiKV instance | -Errors | Leader drop | The count of dropped leaders per TiKV instance | -Errors | Leader missing | The count of missing leaders per TiKV instance | -Server | Leaders | The number of leaders per TiKV instance | -Server | Regions | The number of Regions per TiKV instance | -Server | CF size | The size of each column family | -Server | Store size | The storage size per TiKV instance | -Server | Channel full | The number of Channel Full errors per TiKV instance | It should be `0` in normal case. -Server | Server message failures | The number of failed messages between TiKV instances | -Server | Average Region written keys | The average rate of written keys to Regions per TiKV instance | -Server | Average Region written bytes | The average rate of writing bytes to Regions per TiKV instance | -Server | Active written leaders | The number of leaders being written on each TiKV instance | -Server | Approximate Region size | The approximate Region size | -Raft IO | Apply log duration | The time consumed for Raft to apply logs | -Raft IO | Apply log duration per server | The time consumed for Raft to apply logs per TiKV instance | -Raft IO | Append log duration | The time consumed for Raft to append logs | -Raft IO | Append log duration per server | The time consumed for Raft to append logs per TiKV instance | -Raft process | Ready handled | The count of handled ready buckets per region | -Raft process | Process ready duration per server | The time consumed for peer processes to be ready in Raft | It should be less than `2s` (P99.99). -Raft process | Process tick duration per server | The peer processes in Raft | -Raft process | 99% Duration of raftstore events | The time consumed by raftstore events (P99) | -Raft message | Sent messages per server | The number of Raft messages sent by each TiKV instance | -Raft message | Flush messages per server | The number of Raft messages flushed by each TiKV instance | -Raft message | Receive messages per server | The number of Raft messages received by each TiKV instance | -Raft message | Messages | The number of Raft messages sent per type | -Raft message | Vote | The number of Vote messages sent in Raft | -Raft message | Raft dropped messages | The number of dropped Raft messages per type| -Raft proposal | Raft proposals per ready | The number of Raft proposals of all Regions per ready handled bucket| -Raft proposal | Raft read/write proposals | The number of proposals per type| -Raft proposal | Raft read proposals per server | The number of read proposals made by each TiKV instance | -Raft proposal | Raft write proposals per server | The number of write proposals made by each TiKV instance | -Raft proposal | Proposal wait duration | The wait time of each proposal | -Raft proposal | Proposal wait duration per server | The wait time of each proposal per TiKV instance | -Raft proposal | Raft log speed | The rate at which peers propose logs | -Raft admin | Admin proposals | The number of admin proposals | -Raft admin | Admin apply | The number of processed apply commands | -Raft admin | Check split | The number of raftstore split checks | -Raft admin | 99.99% Check split duration | The time consumed when running split checks (P99.99) | -Local reader | Local reader requests | The number of total requests and the number of rejections from the local read thread | -Local reader | Local read requests duration | The wait time of local read requests | -Local reader | Local read requests batch size | The batch size of local read requests | -Storage | Storage command total | The total number of received commands per type | -Storage | Storage async request error | The total number of engine asynchronous request errors | -Storage | Storage async snapshot duration | The time consumed by processing asynchronous snapshot requests | It should be less than `1s` in `.99`. -Storage | Storage async write duration | The time consumed by processing asynchronous write requests | It should be less than `1s` in `.99`. -Scheduler | Scheduler stage total | The total number of commands at each stage | There should not be lots of errors in a short time. -Scheduler | Scheduler priority commands | The count of different priority commands | -Scheduler | Scheduler pending commands | The count of pending commands per TiKV instance | -Scheduler - XX | Scheduler stage total | The total number of commands at each stage when executing the batch_get command | There should not be lots of errors in a short time. -Scheduler - XX | Scheduler command duration | The time consumed when executing the batch_get command | It should be less than `1s`. -Scheduler - XX | Scheduler latch wait duration | The wait time caused by latch when executing the batch_get command | It should be less than `1s`. -Scheduler - XX | Scheduler keys read | The count of keys read by a batch_get command | -Scheduler - XX | Scheduler keys written | The count of keys written by a batch_get command | -Scheduler - XX | Scheduler scan details | The keys scan details of each CF when executing the batch_get command | -Scheduler - XX | Scheduler scan details [lock] | The keys scan details of lock CF when executing the batch_get command | -Scheduler - XX | Scheduler scan details [write] | The keys scan details of write CF when executing the batch_get command | -Scheduler - XX | Scheduler scan details [default] | The keys scan details of default CF when executing the batch_get command | -Coprocessor | Request duration | The time consumed to handle coprocessor read requests | -Coprocessor | Wait duration | The time consumed when coprocessor requests are waiting to be handled | It should be less than `10s` (P99.99). -Coprocessor | Processing duration | The time consumed to handle coprocessor requests | -Coprocessor | 95% Request duration by store | The time consumed to handle coprocessor read requests per TiKV instance (P95) | -Coprocessor | 95% Wait duration by store | The time consumed when coprocessor requests are waiting to be handled per TiKV instance (P95)| -Coprocessor | 95% Handling duration by store | The time consumed to handle coprocessor requests per TiKV instance (P95) | -Coprocessor | Request errors | The total number of the push down request errors | There should not be lots of errors in a short time. -Coprocessor | DAG executors | The total number of DAG executors | -Coprocessor | Scan keys | The number of keys that each request scans | -Coprocessor | Scan details | The scan details for each CF | -Coprocessor | Table Scan - Details by CF | The table scan details for each CF | -Coprocessor | Index Scan - Details by CF | The index scan details for each CF | -Coprocessor | Table Scan - Perf Statistics | The total number of RocksDB internal operations from PerfContext when executing table scan | -Coprocessor | Index Scan - Perf Statistics | The total number of RocksDB internal operations from PerfContext when executing index scan | -GC | MVCC versions | The number of versions for each key | -GC | MVCC deleted versions | The number of versions deleted by GC for each key | -GC | GC tasks | The count of GC tasks processed by gc_worker | -GC | GC tasks Duration | The time consumed when executing GC tasks | -GC | GC keys (write CF) | The count of keys in write CF affected during GC | -GC | TiDB GC actions result | The TiDB GC action result on Region level | -GC | TiDB GC worker actions | The count of TiDB GC worker actions | -GC | TiDB GC seconds | The GC duration | -GC | TiDB GC failure | The count of failed TiDB GC jobs | -GC | GC lifetime | The lifetime of TiDB GC | -GC | GC interval | The interval of TiDB GC | -Snapshot | Rate snapshot message | The rate at which Raft snapshot messages are sent | -Snapshot | 99% Handle snapshot duration | The time consumed to handle snapshots (P99) | -Snapshot | Snapshot state count | The number of snapshots per state | -Snapshot | 99.99% Snapshot size | The snapshot size (P99.99) | -Snapshot | 99.99% Snapshot KV count | The number of KV within a snapshot (P99.99) | -Task | Worker handled tasks | The number of tasks handled by worker | -Task | Worker pending tasks | Current number of pending and running tasks of worker | It should be less than `1000`. -Task | FuturePool handled tasks | The number of tasks handled by future_pool | -Task | FuturePool pending tasks | Current number of pending and running tasks of future_pool | -Thread CPU | Raft store CPU | The CPU utilization of the raftstore thread | The CPU usage should be less than `80%`. -Thread CPU | Async apply CPU | The CPU utilization of async apply | The CPU usage should be less than `90%`. -Thread CPU | Scheduler CPU | The CPU utilization of scheduler | The CPU usage should be less than `80%`. -Thread CPU | Scheduler Worker CPU | The CPU utilization of scheduler worker | -Thread CPU | Storage ReadPool CPU | The CPU utilization of readpool | -Thread CPU | Coprocessor CPU | The CPU utilization of coprocessor | -Thread CPU | Snapshot worker CPU | The CPU utilization of snapshot worker | -Thread CPU | Split check CPU | The CPU utilization of split check | -Thread CPU | RocksDB CPU | The CPU utilization of RocksDB | -Thread CPU | gRPC poll CPU | The CPU utilization of gRPC | The CPU usage should be less than `80%`. -RocksDB - XX | Get operations | The count of get operations | -RocksDB - XX | Get duration | The time consumed when executing get operations | -RocksDB - XX | Seek operations | The count of seek operations | -RocksDB - XX | Seek duration | The time consumed when executing seek operations | -RocksDB - XX | Write operations | The count of write operations | -RocksDB - XX | Write duration | The time consumed when executing write operations | -RocksDB - XX | WAL sync operations | The count of WAL sync operations | -RocksDB - XX | WAL sync duration | The time consumed when executing WAL sync operations | -RocksDB - XX | Compaction operations | The count of compaction and flush operations | -RocksDB - XX | Compaction duration | The time consumed when executing the compaction and flush operations | -RocksDB - XX | SST read duration | The time consumed when reading SST files | -RocksDB - XX | Write stall duration | Write stall duration | It should be `0` in normal case. -RocksDB - XX | Memtable size | The memtable size of each column family | -RocksDB - XX | Memtable hit | The hit rate of memtable | -RocksDB - XX | Block cache size | The block cache size. Broken down by column family if shared block cache is disabled. | -RocksDB - XX | Block cache hit | The hit rate of block cache | -RocksDB - XX | Block cache flow | The flow rate of block cache operations per type | -RocksDB - XX | Block cache operations | The count of block cache operations per type | -RocksDB - XX | Keys flow | The flow rate of operations on keys per type | -RocksDB - XX | Total keys | The count of keys in each column family | -RocksDB - XX | Read flow | The flow rate of read operations per type | -RocksDB - XX | Bytes / Read | The bytes per read operation| -RocksDB - XX | Write flow | The flow rate of write operations per type| -RocksDB - XX | Bytes / Write | The bytes per write operation | -RocksDB - XX | Compaction flow | The flow rate of compaction operations per type | -RocksDB - XX | Compaction pending bytes | The pending bytes to be compacted | -RocksDB - XX | Read amplification | The read amplification per TiKV instance | -RocksDB - XX | Compression ratio | The compression ratio of each level | -RocksDB - XX | Number of snapshots | The number of snapshots per TiKV instance | -RocksDB - XX | Oldest snapshots duration | The time that the oldest unreleased snapshot survivals | -RocksDB - XX | Number files at each level | The number of SST files for different column families in each level | -RocksDB - XX | Ingest SST duration seconds | The time consumed to ingest SST files | -RocksDB - XX | Stall conditions changed of each CF | Stall conditions changed of each column family | -gRPC | gRPC messages | The count of gRPC messages per type | -gRPC | gRPC message failed | The count of failed gRPC messages per type| -gRPC | 99% gRPC message duration | The gRPC message duration per message type (P99) | -gRPC | gRPC GC message count | The count of gRPC GC messages | -gRPC | 99% gRPC KV GC message duration | The execution time of gRPC GC messages (P99) | -PD | PD requests | The count of requests that TiKV sends to PD | -PD | PD request duration (average) | The time consumed by requests that TiKV sends to PD | -PD | PD heartbeats | The total number of PD heartbeat messages | -PD | PD validated peers | The total number of peers validated by the PD worker | - -## TiKV dashboard interface - -This section shows images of the service panels on the TiKV dashboard. - -### Cluster +If you use TiUP to deploy the TiDB cluster, the monitoring system (Prometheus/Grafana) is deployed at the same time. For more information, see [Overview of the Monitoring Framework](/tidb-monitoring-framework.md). + +The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, and so on. A lot of metrics are there to help you diagnose. + +You can get an overview of the component TiKV status from the **TiKV-Details** dashboard, where the key metrics are displayed. According to the [Performance Map](https://asktug.com/_/tidb-performance-map/#/), you can check whether the status of the cluster is as expected. + +This document provides a detailed description of these key metrics on the **TiKV-Details** dashboard. + +## Cluster + +- Store size: The storage size per TiKV instance +- Available size: The available capacity per TiKV instance +- Capacity size: The capacity size per TiKV instance +- CPU: The CPU utilization per TiKV instance +- Memory: The memory usage per TiKV instance +- IO utilization: The I/O utilization per TiKV instance +- MBps: The total bytes of read and write in each TiKV instance +- QPS: The QPS per command in each TiKV instance +- Errps: The rate of gRPC message failures +- leader: The number of leaders per TiKV instance +- Region: The number of Regions per TiKV instance +- Uptime: The runtime of TiKV since last restart ![TiKV Dashboard - Cluster metrics](/media/tikv-dashboard-cluster.png) -### Errors +## Errors + +- Critical error: The number of critical errors +- Server is busy: Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, and so on. It should be `0` in normal case. +- Server report failures: The number of error messages reported by server. It should be `0` in normal case. +- Raftstore error: The number of Raftstore errors per type on each TiKV instance +- Scheduler error: The number of scheduler errors per type on each TiKV instance +- Coprocessor error: The number of coprocessor errors per type on each TiKV instance +- gRPC message error: The number of gRPC message errors per type on each TiKV instance +- Leader drop: The count of dropped leaders per TiKV instance +- Leader missing: The count of missing leaders per TiKV instance ![TiKV Dashboard - Errors metrics](/media/tikv-dashboard-errors.png) -### Server +## Server + +- CF size: The size of each column family +- Store size: The storage size per TiKV instance +- Channel full: The number of Channel Full errors per TiKV instance. It should be `0` in normal case. +- Active written leaders: The number of leaders being written on each TiKV instance +- Approximate Region size: The approximate Region size +- Approximate Region size Histogram: The histogram of each approximate Region size +- Region average written keys: The average number of written keys to Regions per TiKV instance +- Region average written bytes: The average written bytes to Regions per TiKV instance ![TiKV Dashboard - Server metrics](/media/tikv-dashboard-server.png) -### Raft IO +## gRPC + +- gRPC message count: The rate of gRPC messages per type +- gRPC message failed: The rate of failed gRPC messages +- 99% gRPC message duration: The gRPC message duration per message type (P99) +- Average gRPC message duration: The average execution time of gRPC messages +- gRPC batch size: The batch size of gRPC messages between TiDB and TiKV +- Raft message batch size: The batch size of Raft messages between TiKV instances + +## Thread CPU + +- Raft store CPU: The CPU utilization of the `raftstore` thread. The CPU utilization should be less than 80% * `raftstore.store-pool-size` in normal case. +- Async apply CPU: The CPU utilization of the `async apply` thread. The CPU utilization should be less than 90% * `raftstore.apply-pool-size` in normal cases. +- Scheduler worker CPU: The CPU utilization of the `scheduler worker` thread. The CPU utilization should be less than 90% * `storage.scheduler-worker-pool-size` in normal cases. +- gRPC poll CPU: The CPU utilization of the `gRPC` thread. The CPU utilization should be less than 80% * `server.grpc-concurrency` in normal cases. +- Unified read pool CPU: The CPU utilization of the `unified read pool` thread +- Storage ReadPool CPU: The CPU utilization of the `storage read pool` thread +- Coprocessor CPU: The CPU utilization of the `coprocessor` thread +- RocksDB CPU: The CPU utilization of the RocksDB thread +- Split check CPU: The CPU utilization of the `split check` thread +- GC worker CPU: The CPU utilization of the `GC worker` thread +- Snapshot worker CPU: The CPU utilization of the `snapshot worker` thread + +## PD + +- PD requests: The rate at which TiKV sends to PD +- PD request duration (average): The average duration of processing requests that TiKV sends to PD +- PD heartbeats: The rate at which heartbeat messages are sent from TiKV to PD +- PD validate peers: The rate at which messages are sent from TiKV to PD to validate TiKV peers + +## Raft IO + +- Apply log duration: The time consumed for Raft to apply logs +- Apply log duration per server: The time consumed for Raft to apply logs per TiKV instance +- Append log duration: The time consumed for Raft to append logs +- Append log duration per server: The time consumed for Raft to append logs per TiKV instance +- Commit log duration: The time consumed by Raft to commit logs +- Commit log duration per server: The time consumed by Raft to commit logs per TiKV instance ![TiKV Dashboard - Raft IO metrics](/media/tikv-dashboard-raftio.png) -### Raft process +## Raft process + +- Ready handled: The count of handled ready operations per second +- 0.99 Duration of Raft store events: The time consumed by Raftstore events (P99) +- Process ready duration: The time consumed for processes to be ready in Raft +- Process ready duration per server: The time consumed for peer processes to be ready in Raft per TiKV instance. It should be less than 2 seconds (P99.99). ![TiKV Dashboard - Raft process metrics](/media/tikv-dashboard-raft-process.png) -### Raft message +## Raft message + +- Sent messages per server: The number of Raft messages sent by each TiKV instance per second +- Flush messages per server: The number of Raft messages flushed by the Raft client in each TiKV instance per second +- Receive messages per server: The number of Raft messages received by each TiKV instance per second +- Messages: The number of Raft messages sent per type per second +- Vote: The number of Vote messages sent in Raft per second +- Raft dropped messages: The number of dropped Raft messages per type per second ![TiKV Dashboard - Raft message metrics](/media/tikv-dashboard-raft-message.png) -### Raft proposal +## Raft propose -![TiKV Dashboard - Raft proposal metrics](/media/tikv-dashboard-raft-propose.png) +- Raft apply proposals per ready: The histogram of the number of proposals that each ready operation contains in a batch while applying proposal. +- Raft read/write proposals: The number of proposals per type per second +- Raft read proposals per server: The number of read proposals made by each TiKV instance per second +- Raft write proposals per server: The number of write proposals made by each TiKV instance per second +- Propose wait duration: The histogram of waiting time of each proposal +- Propose wait duration per server: The histogram of waiting time of each proposal per TiKV instance +- Apply wait duration: The histogram of apply time of each proposal +- Apply wait duration per server: The histogram of apply time of each proposal per TiKV instance +- Raft log speed: The average rate at which peers propose logs -### Raft admin +![TiKV Dashboard - Raft propose metrics](/media/tikv-dashboard-raft-propose.png) + +## Raft admin + +- Admin proposals: The number of admin proposals per second +- Admin apply: The number of processed apply commands per second +- Check split: The number of Raftstore split check commands per second +- 99.99% Check split duration: The time consumed when running split check commands (P99.99) ![TiKV Dashboard - Raft admin metrics](/media/tikv-dashboard-raft-admin.png) -### Local reader +## Local reader + +- Local reader requests: The number of total requests and the number of rejections from the local read thread ![TiKV Dashboard - Local reader metrics](/media/tikv-dashboard-local-reader.png) -### Storage +## Unified Read Pool -![TiKV Dashboard - Storage metrics](/media/tikv-dashboard-storage.png) +- Time used by level: The time consumed for each level in the unified read pool. Level 0 means small queries. +- Level 0 chance: The proportion of level 0 tasks in unified read pool +- Running tasks: The number of tasks running concurrently in the unified read pool -### Scheduler +## Storage -![TiKV Dashboard - Scheduler metrics](/media/tikv-dashboard-scheduler.png) +- Storage command total: The number of received command by type per second +- Storage async request error: The number of engine asynchronous request errors per second +- Storage async snapshot duration: The time consumed by processing asynchronous snapshot requests. It should be less than `1s` in `.99`. +- Storage async write duration: The time consumed by processing asynchronous write requests. It should be less than `1s` in `.99`. -### Scheduler - batch_get +![TiKV Dashboard - Storage metrics](/media/tikv-dashboard-storage.png) + +## Scheduler -![TiKV Dashboard - Scheduler - batch_get metrics](/media/tikv-dashboard-scheduler-batch-get.png) +- Scheduler stage total: The number of commands at each stage per second. There should not be a lot of errors in a short time. +- Scheduler writing bytes: The total written bytes by commands processed on each TiKV instance +- Scheduler priority commands: The count of different priority commands per second +- Scheduler pending commands: The count of pending commands per TiKV instance per second -### Scheduler - cleanup +![TiKV Dashboard - Scheduler metrics](/media/tikv-dashboard-scheduler.png) -![TiKV Dashboard - Scheduler - cleanup metrics](/media/tikv-dashboard-scheduler-cleanup.png) +## Scheduler - commit -### Scheduler - commit +- Scheduler stage total: The number of commands at each stage per second when executing the commit command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the commit command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the commit command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a commit command +- Scheduler keys written: The count of keys written by a commit command +- Scheduler scan details: The keys scan details of each CF when executing the commit command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the commit command +- Scheduler scan details [write]: The keys scan details of write CF when executing the commit command +- Scheduler scan details [default]: The keys scan details of default CF when executing the commit command ![TiKV Dashboard - Scheduler commit metrics](/media/tikv-dashboard-scheduler-commit.png) + +## Scheduler - pessimistic_rollback + +- Scheduler stage total: The number of commands at each stage per second when executing the `pessimistic_rollback` command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the `pessimistic_rollback` command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the `pessimistic_rollback` command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a `pessimistic_rollback` command +- Scheduler keys written: The count of keys written by a `pessimistic_rollback` command +- Scheduler scan details: The keys scan details of each CF when executing the `pessimistic_rollback` command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the `pessimistic_rollback` command +- Scheduler scan details [write]: The keys scan details of write CF when executing the `pessimistic_rollback` command +- Scheduler scan details [default]: The keys scan details of default CF when executing the `pessimistic_rollback` command + +## Scheduler - prewrite + +- Scheduler stage total: The number of commands at each stage per second when executing the prewrite command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the prewrite command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the prewrite command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a prewrite command +- Scheduler keys written: The count of keys written by a prewrite command +- Scheduler scan details: The keys scan details of each CF when executing the prewrite command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the prewrite command +- Scheduler scan details [write]: The keys scan details of write CF when executing the prewrite command +- Scheduler scan details [default]: The keys scan details of default CF when executing the prewrite command + +## Scheduler - rollback + +- Scheduler stage total: The number of commands at each stage per second when executing the rollback command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the rollback command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the rollback command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a rollback command +- Scheduler keys written: The count of keys written by a rollback command +- Scheduler scan details: The keys scan details of each CF when executing the rollback command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the rollback command +- Scheduler scan details [write]: The keys scan details of write CF when executing the rollback command +- Scheduler scan details [default]: The keys scan details of default CF when executing the rollback command + +## GC + +- MVCC versions: The number of versions for each key +- MVCC delete versions: The number of versions deleted by GC for each key +- GC tasks: The count of GC tasks processed by gc_worker +- GC tasks Duration: The time consumed when executing GC tasks +- GC keys (write CF): The count of keys in write CF affected during GC +- TiDB GC worker actions: The count of TiDB GC worker actions +- TiDB GC seconds: The GC duration +- GC speed: The number of keys deleted by GC per second +- TiKV AutoGC Working: The status of Auto GC +- ResolveLocks Progress: The progress of the first phase of GC (Resolve Locks) +- TiKV Auto GC Progress: The progress of the second phase of GC +- TiKV Auto GC SafePoint: The value of TiKV GC safe point. The safe point is the current GC timestamp +- GC lifetime: The lifetime of TiDB GC +- GC interval: The interval of TiDB GC + +## Snapshot + +- Rate snapshot message: The rate at which Raft snapshot messages are sent +- 99% Handle snapshot duration: The time consumed to handle snapshots (P99) +- Snapshot state count: The number of snapshots per state +- 99.99% Snapshot size: The snapshot size (P99.99) +- 99.99% Snapshot KV count: The number of KV within a snapshot (P99.99) + +## Task + +- Worker handled tasks: The number of tasks handled by worker per second +- Worker pending tasks: Current number of pending and running tasks of worker per second. It should be less than `1000` in normal case. +- FuturePool handled tasks: The number of tasks handled by future pool per second +- FuturePool pending tasks: Current number of pending and running tasks of future pool per second + +## Coprocessor Overview + +- Request duration: The total duration from the time of receiving the coprocessor request to the time of finishing processing the request +- Total Requests: The number of requests by type per second +- Handle duration: The histogram of time spent actually processing coprocessor requests per minute +- Total Request Errors: The number of request errors of Coprocessor per second. There should not be a lot of errors in a short time. +- Total KV Cursor Operations: The total number of the KV cursor operations by type per second, such as `select`, `index`, `analyze_table`, `analyze_index`, `checksum_table`, `checksum_index`, and so on. +- KV Cursor Operations: The histogram of KV cursor operations by type per second +- Total RocksDB Perf Statistics: The statistics of RocksDB performance +- Total Response Size: The total size of coprocessor response + +## Coprocessor Detail + +- Handle duration: The histogram of time spent actually processing coprocessor requests per minute +- 95% Handle duration by store: The time consumed to handle coprocessor requests per TiKV instance per second (P95) +- Wait duration: The time consumed when coprocessor requests are waiting to be handled. It should be less than `10s` (P99.99). +- 95% Wait duration by store: The time consumed when coprocessor requests are waiting to be handled per TiKV instance per second (P95) +- Total DAG Requests: The total number of DAG requests per second +- Total DAG Executors: The total number of DAG executors per second +- Total Ops Details (Table Scan): The number of RocksDB internal operations per second when executing select scan in coprocessor +- Total Ops Details (Index Scan): The number of RocksDB internal operations per second when executing index scan in coprocessor +- Total Ops Details by CF (Table Scan): The number of RocksDB internal operations for each CF per second when executing select scan in coprocessor +- Total Ops Details by CF (Index Scan): The number of RocksDB internal operations for each CF per second when executing index scan in coprocessor + +## Threads + +- Threads state: The state of TiKV threads +- Threads IO: The I/O traffic of each TiKV thread +- Thread Voluntary Context Switches: The number of TiKV threads voluntary context switches +- Thread Nonvoluntary Context Switches: The number of TiKV threads nonvoluntary context switches + +## RocksDB - kv/raft + +- Get operations: The count of get operations per second +- Get duration: The time consumed when executing get operations +- Seek operations: The count of seek operations per second +- Seek duration: The time consumed when executing seek operations +- Write operations: The count of write operations per second +- Write duration: The time consumed when executing write operations +- WAL sync operations: The count of WAL sync operations per second +- Write WAL duration: The time consumed for writing WAL +- WAL sync duration: The time consumed when executing WAL sync operations +- Compaction operations: The count of compaction and flush operations per second +- Compaction duration: The time consumed when executing the compaction and flush operations +- SST read duration: The time consumed when reading SST files +- Write stall duration: Write stall duration. It should be `0` in normal case. +- Memtable size: The memtable size of each column family +- Memtable hit: The hit rate of memtable +- Block cache size: The block cache size. Broken down by column family if shared block cache is disabled. +- Block cache hit: The hit rate of block cache +- Block cache flow: The flow rate of block cache operations per type +- Block cache operations: The count of block cache operations per type +- Keys flow: The flow rate of operations on keys per type +- Total keys: The count of keys in each column family +- Read flow: The flow rate of read operations per type +- Bytes / Read: The bytes per read operation +- Write flow: The flow rate of write operations per type +- Bytes / Write: The bytes per write operation +- Compaction flow: The flow rate of compaction operations per type +- Compaction pending bytes: The pending bytes to be compacted +- Read amplification: The read amplification per TiKV instance +- Compression ratio: The compression ratio of each level +- Number of snapshots: The number of snapshots per TiKV instance +- Oldest snapshots duration: The time that the oldest unreleased snapshot survivals +- Number files at each level: The number of SST files for different column families in each level +- Ingest SST duration seconds: The time consumed to ingest SST files +- Stall conditions changed of each CF: Stall conditions changed of each column family + +## Titan - All + +- Blob file count: The number of Titan blob files +- Blob file size: The total size of Titan blob file +- Live blob size: The total size of valid blob record +- Blob cache hit: The hit rate of Titan block cache +- Iter touched blob file count: The number of blob file involved in a single iterator +- Blob file discardable ratio distribution: The ratio distribution of blob record failure of blob files +- Blob key size: The size of Titan blob keys +- Blob value size: The size of Titan blob values +- Blob get operations: The count of get operations in Titan blob +- Blob get duration: The time consumed when executing get operations in Titan blob +- Blob iter operations: The time consumed when executing iter operations in Titan blob +- Blob seek duration: The time consumed when executing seek operations in Titan blob +- Blob next duration: The time consumed when executing next operations in Titan blob +- Blob prev duration: The time consumed when executing prev operations in Titan blob +- Blob keys flow: The flow rate of operations on Titan blob keys +- Blob bytes flow: The flow rate of bytes on Titan blob keys +- Blob file read duration: The time consumed when reading Titan blob file +- Blob file write duration: The time consumed when writing Titan blob file +- Blob file sync operations: The count of blob file sync operations +- Blob file sync duration: The time consumed when synchronizing blob file +- Blob GC action: The count of Titan GC actions +- Blob GC duration: The Titan GC duration +- Blob GC keys flow: The flow rate of keys read and written by Titan GC +- Blob GC bytes flow: The flow rate of bytes read and written by Titan GC +- Blob GC input file size: The size of Titan GC input file +- Blob GC output file size: The size of Titan GC output file +- Blob GC file count: The count of blob files involved in Titan GC + +## Lock manager + +- Thread CPU: The CPU utilization of the lock manager thread +- Handled tasks: The number of tasks handled by lock manager +- Waiter lifetime duration: The waiting time of the transaction for the lock to be released +- Wait table: The status information of wait table, including the number of locks and the number of transactions waiting for the lock +- Deadlock detect duration: The time consumed for detecting deadlock +- Detect error: The number of errors encountered when detecting deadlock, including the number of deadlocks +- Deadlock detector leader: The information of the node where the deadlock detector leader is located + +## Memory + +- Allocator Stats: The statistics of the memory allocator + +## Backup + +- Backup CPU: The CPU utilization of the backup thread +- Range Size: The histogram of backup range size +- Backup Duration: The time consumed for backup +- Backup Flow: The total bytes of backup +- Disk Throughput: The disk throughput per instance +- Backup Range Duration: The time consumed for backing up a range +- Backup Errors: The number of errors encountered during a backup + +## Encryption + +- Encryption data keys: The total number of encrypted data keys +- Encrypted files: The number of encrypted files +- Encryption initialized: Shows whether encryption is enabled. `1` means enabled. +- Encryption meta files size: The size of the encryption meta file +- Encrypt/decrypt data nanos: The histogram of duration on encrypting/decrypting data each time +- Read/write encryption meta duration: The time consumed for reading/writing encryption meta files + +## Explanation of Common Parameters + +### gRPC Message Type + +1. Transactional API: + + - kv_get: The command of getting the latest version of data specified by `ts` + - kv_scan: The command of scanning a range of data + - kv_prewrite: The command of prewriting the data to be committed at first phase of 2PC + - kv_pessimistic_lock: The command of adding a pessimistic lock to the key to prevent other transaction from modifying this key + - kv_pessimistic_rollback: The command of deleting the pessimistic lock on the key + - kv_txn_heart_beat: The command of updating `lock_ttl` for pessimistic transactions or large transactions to prevent them from rolling back + - kv_check_txn_status: The command of checking the status of the transaction + - kv_commit: The command of committing the data written by the prewrite command + - kv_cleanup: The command of rolling back a transaction, which is deprecated in v4.0 + - kv_batch_get: The command of getting the value of batch key at once, similar to `kv_get` + - kv_batch_rollback: The command of batch rollback of multiple prewrite transactions + - kv_scan_lock: The command of scanning all locks with a version number before `max_version` to clean up expired transactions + - kv_resolve_lock: The command of committing or rollback the transaction lock, according to the transaction status. + - kv_gc: The command of GC + - kv_delete_range: The command of deleting a range of data from TiKV + +2. Raw API: + + - raw_get: The command of getting the value of key + - raw_batch_get: The command of getting the value of batch keys + - raw_scan: The command of scanning a range of data + - raw_batch_scan: The command of scanning multiple consecutive data range + - raw_put: The command of writing a key/value pair + - raw_batch_put: The command of writing a batch of key/value pairs + - raw_delete: The command of deleting a key/value pair + - raw_batch_delete: The command of a batch of key/value pairs + - raw_delete_range: The command of deleting a range of data