From 38caf95dc02cf4dcbd8735126a3041acbb60fb3d Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Fri, 26 Jun 2020 23:30:12 +0800 Subject: [PATCH 01/13] Update grafana tikv dashboard doc --- grafana-tikv-dashboard.md | 575 +++++++++++++++++++++++++------------- 1 file changed, 378 insertions(+), 197 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index dff06b60d0c36..f98774a9e5aa8 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -5,236 +5,417 @@ category: reference aliases: ['/docs/dev/grafana-tikv-dashboard/','/docs/dev/reference/key-monitoring-metrics/tikv-dashboard/'] --- -# Key Monitoring Metrics of TiKV - -If you use TiDB Ansible to deploy the TiDB cluster, the monitoring system is deployed at the same time. For more information, see [Overview of the Monitoring Framework](/tidb-monitoring-framework.md). - -The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, Disk Performance, and so on. A lot of metrics are there to help you diagnose. - -You can get an overview of the component TiKV status from the TiKV dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics. - -## Key metrics description - -To understand the key metrics displayed on the Overview dashboard, check the following table: - -Service | Panel name | Description | Normal range ----------------- | ---------------- | ---------------------------------- | -------------- -Cluster | Store size | The storage size per TiKV instance | -Cluster | Available size | The available capacity per TiKV instance | -Cluster | Capacity size | The capacity size per TiKV instance | -Cluster | CPU | The CPU usage per TiKV instance | -Cluster | Memory | The memory usage per TiKV instance | -Cluster | IO utilization | The I/O utilization per TiKV instance | -Cluster | MBps | The total bytes of read and write in each TiKV instance | -Cluster | QPS | The QPS per command in each TiKV instance | -Cluster | Errors-gRPC | The total number of gRPC message failures | -Cluster | Leaders | The number of leaders per TiKV instance | -Cluster | Regions | The number of Regions per TiKV instance | -Errors | Server is busy | Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, Scheduler Busy, and Coprocessor Full| -Errors | Server message failures | The number of failed messages between TiKV instances | It should be `0` in normal case. -Errors | Raftstore errors | The number of Raftstore errors per type on each TiKV instance | -Errors | Scheduler errors | The number of scheduler errors per type on each TiKV instance | -Errors | Coprocessor errors | The number of coprocessor errors per type on each TiKV instance | -Errors | gRPC message errors | The number of gRPC message errors per type on each TiKV instance | -Errors | Leader drop | The count of dropped leaders per TiKV instance | -Errors | Leader missing | The count of missing leaders per TiKV instance | -Server | Leaders | The number of leaders per TiKV instance | -Server | Regions | The number of Regions per TiKV instance | -Server | CF size | The size of each column family | -Server | Store size | The storage size per TiKV instance | -Server | Channel full | The number of Channel Full errors per TiKV instance | It should be `0` in normal case. -Server | Server message failures | The number of failed messages between TiKV instances | -Server | Average Region written keys | The average rate of written keys to Regions per TiKV instance | -Server | Average Region written bytes | The average rate of writing bytes to Regions per TiKV instance | -Server | Active written leaders | The number of leaders being written on each TiKV instance | -Server | Approximate Region size | The approximate Region size | -Raft IO | Apply log duration | The time consumed for Raft to apply logs | -Raft IO | Apply log duration per server | The time consumed for Raft to apply logs per TiKV instance | -Raft IO | Append log duration | The time consumed for Raft to append logs | -Raft IO | Append log duration per server | The time consumed for Raft to append logs per TiKV instance | -Raft process | Ready handled | The count of handled ready buckets per region | -Raft process | Process ready duration per server | The time consumed for peer processes to be ready in Raft | It should be less than `2s` (P99.99). -Raft process | Process tick duration per server | The peer processes in Raft | -Raft process | 99% Duration of raftstore events | The time consumed by raftstore events (P99) | -Raft message | Sent messages per server | The number of Raft messages sent by each TiKV instance | -Raft message | Flush messages per server | The number of Raft messages flushed by each TiKV instance | -Raft message | Receive messages per server | The number of Raft messages received by each TiKV instance | -Raft message | Messages | The number of Raft messages sent per type | -Raft message | Vote | The number of Vote messages sent in Raft | -Raft message | Raft dropped messages | The number of dropped Raft messages per type| -Raft proposal | Raft proposals per ready | The number of Raft proposals of all Regions per ready handled bucket| -Raft proposal | Raft read/write proposals | The number of proposals per type| -Raft proposal | Raft read proposals per server | The number of read proposals made by each TiKV instance | -Raft proposal | Raft write proposals per server | The number of write proposals made by each TiKV instance | -Raft proposal | Proposal wait duration | The wait time of each proposal | -Raft proposal | Proposal wait duration per server | The wait time of each proposal per TiKV instance | -Raft proposal | Raft log speed | The rate at which peers propose logs | -Raft admin | Admin proposals | The number of admin proposals | -Raft admin | Admin apply | The number of processed apply commands | -Raft admin | Check split | The number of raftstore split checks | -Raft admin | 99.99% Check split duration | The time consumed when running split checks (P99.99) | -Local reader | Local reader requests | The number of total requests and the number of rejections from the local read thread | -Local reader | Local read requests duration | The wait time of local read requests | -Local reader | Local read requests batch size | The batch size of local read requests | -Storage | Storage command total | The total number of received commands per type | -Storage | Storage async request error | The total number of engine asynchronous request errors | -Storage | Storage async snapshot duration | The time consumed by processing asynchronous snapshot requests | It should be less than `1s` in `.99`. -Storage | Storage async write duration | The time consumed by processing asynchronous write requests | It should be less than `1s` in `.99`. -Scheduler | Scheduler stage total | The total number of commands at each stage | There should not be lots of errors in a short time. -Scheduler | Scheduler priority commands | The count of different priority commands | -Scheduler | Scheduler pending commands | The count of pending commands per TiKV instance | -Scheduler - XX | Scheduler stage total | The total number of commands at each stage when executing the batch_get command | There should not be lots of errors in a short time. -Scheduler - XX | Scheduler command duration | The time consumed when executing the batch_get command | It should be less than `1s`. -Scheduler - XX | Scheduler latch wait duration | The wait time caused by latch when executing the batch_get command | It should be less than `1s`. -Scheduler - XX | Scheduler keys read | The count of keys read by a batch_get command | -Scheduler - XX | Scheduler keys written | The count of keys written by a batch_get command | -Scheduler - XX | Scheduler scan details | The keys scan details of each CF when executing the batch_get command | -Scheduler - XX | Scheduler scan details [lock] | The keys scan details of lock CF when executing the batch_get command | -Scheduler - XX | Scheduler scan details [write] | The keys scan details of write CF when executing the batch_get command | -Scheduler - XX | Scheduler scan details [default] | The keys scan details of default CF when executing the batch_get command | -Coprocessor | Request duration | The time consumed to handle coprocessor read requests | -Coprocessor | Wait duration | The time consumed when coprocessor requests are waiting to be handled | It should be less than `10s` (P99.99). -Coprocessor | Processing duration | The time consumed to handle coprocessor requests | -Coprocessor | 95% Request duration by store | The time consumed to handle coprocessor read requests per TiKV instance (P95) | -Coprocessor | 95% Wait duration by store | The time consumed when coprocessor requests are waiting to be handled per TiKV instance (P95)| -Coprocessor | 95% Handling duration by store | The time consumed to handle coprocessor requests per TiKV instance (P95) | -Coprocessor | Request errors | The total number of the push down request errors | There should not be lots of errors in a short time. -Coprocessor | DAG executors | The total number of DAG executors | -Coprocessor | Scan keys | The number of keys that each request scans | -Coprocessor | Scan details | The scan details for each CF | -Coprocessor | Table Scan - Details by CF | The table scan details for each CF | -Coprocessor | Index Scan - Details by CF | The index scan details for each CF | -Coprocessor | Table Scan - Perf Statistics | The total number of RocksDB internal operations from PerfContext when executing table scan | -Coprocessor | Index Scan - Perf Statistics | The total number of RocksDB internal operations from PerfContext when executing index scan | -GC | MVCC versions | The number of versions for each key | -GC | MVCC deleted versions | The number of versions deleted by GC for each key | -GC | GC tasks | The count of GC tasks processed by gc_worker | -GC | GC tasks Duration | The time consumed when executing GC tasks | -GC | GC keys (write CF) | The count of keys in write CF affected during GC | -GC | TiDB GC actions result | The TiDB GC action result on Region level | -GC | TiDB GC worker actions | The count of TiDB GC worker actions | -GC | TiDB GC seconds | The GC duration | -GC | TiDB GC failure | The count of failed TiDB GC jobs | -GC | GC lifetime | The lifetime of TiDB GC | -GC | GC interval | The interval of TiDB GC | -Snapshot | Rate snapshot message | The rate at which Raft snapshot messages are sent | -Snapshot | 99% Handle snapshot duration | The time consumed to handle snapshots (P99) | -Snapshot | Snapshot state count | The number of snapshots per state | -Snapshot | 99.99% Snapshot size | The snapshot size (P99.99) | -Snapshot | 99.99% Snapshot KV count | The number of KV within a snapshot (P99.99) | -Task | Worker handled tasks | The number of tasks handled by worker | -Task | Worker pending tasks | Current number of pending and running tasks of worker | It should be less than `1000`. -Task | FuturePool handled tasks | The number of tasks handled by future_pool | -Task | FuturePool pending tasks | Current number of pending and running tasks of future_pool | -Thread CPU | Raft store CPU | The CPU utilization of the raftstore thread | The CPU usage should be less than `80%`. -Thread CPU | Async apply CPU | The CPU utilization of async apply | The CPU usage should be less than `90%`. -Thread CPU | Scheduler CPU | The CPU utilization of scheduler | The CPU usage should be less than `80%`. -Thread CPU | Scheduler Worker CPU | The CPU utilization of scheduler worker | -Thread CPU | Storage ReadPool CPU | The CPU utilization of readpool | -Thread CPU | Coprocessor CPU | The CPU utilization of coprocessor | -Thread CPU | Snapshot worker CPU | The CPU utilization of snapshot worker | -Thread CPU | Split check CPU | The CPU utilization of split check | -Thread CPU | RocksDB CPU | The CPU utilization of RocksDB | -Thread CPU | gRPC poll CPU | The CPU utilization of gRPC | The CPU usage should be less than `80%`. -RocksDB - XX | Get operations | The count of get operations | -RocksDB - XX | Get duration | The time consumed when executing get operations | -RocksDB - XX | Seek operations | The count of seek operations | -RocksDB - XX | Seek duration | The time consumed when executing seek operations | -RocksDB - XX | Write operations | The count of write operations | -RocksDB - XX | Write duration | The time consumed when executing write operations | -RocksDB - XX | WAL sync operations | The count of WAL sync operations | -RocksDB - XX | WAL sync duration | The time consumed when executing WAL sync operations | -RocksDB - XX | Compaction operations | The count of compaction and flush operations | -RocksDB - XX | Compaction duration | The time consumed when executing the compaction and flush operations | -RocksDB - XX | SST read duration | The time consumed when reading SST files | -RocksDB - XX | Write stall duration | Write stall duration | It should be `0` in normal case. -RocksDB - XX | Memtable size | The memtable size of each column family | -RocksDB - XX | Memtable hit | The hit rate of memtable | -RocksDB - XX | Block cache size | The block cache size. Broken down by column family if shared block cache is disabled. | -RocksDB - XX | Block cache hit | The hit rate of block cache | -RocksDB - XX | Block cache flow | The flow rate of block cache operations per type | -RocksDB - XX | Block cache operations | The count of block cache operations per type | -RocksDB - XX | Keys flow | The flow rate of operations on keys per type | -RocksDB - XX | Total keys | The count of keys in each column family | -RocksDB - XX | Read flow | The flow rate of read operations per type | -RocksDB - XX | Bytes / Read | The bytes per read operation| -RocksDB - XX | Write flow | The flow rate of write operations per type| -RocksDB - XX | Bytes / Write | The bytes per write operation | -RocksDB - XX | Compaction flow | The flow rate of compaction operations per type | -RocksDB - XX | Compaction pending bytes | The pending bytes to be compacted | -RocksDB - XX | Read amplification | The read amplification per TiKV instance | -RocksDB - XX | Compression ratio | The compression ratio of each level | -RocksDB - XX | Number of snapshots | The number of snapshots per TiKV instance | -RocksDB - XX | Oldest snapshots duration | The time that the oldest unreleased snapshot survivals | -RocksDB - XX | Number files at each level | The number of SST files for different column families in each level | -RocksDB - XX | Ingest SST duration seconds | The time consumed to ingest SST files | -RocksDB - XX | Stall conditions changed of each CF | Stall conditions changed of each column family | -gRPC | gRPC messages | The count of gRPC messages per type | -gRPC | gRPC message failed | The count of failed gRPC messages per type| -gRPC | 99% gRPC message duration | The gRPC message duration per message type (P99) | -gRPC | gRPC GC message count | The count of gRPC GC messages | -gRPC | 99% gRPC KV GC message duration | The execution time of gRPC GC messages (P99) | -PD | PD requests | The count of requests that TiKV sends to PD | -PD | PD request duration (average) | The time consumed by requests that TiKV sends to PD | -PD | PD heartbeats | The total number of PD heartbeat messages | -PD | PD validated peers | The total number of peers validated by the PD worker | - -## TiKV dashboard interface - -This section shows images of the service panels on the TiKV dashboard. - -### Cluster +# The metrics description of TiKV + +If you use TiUP to deploy the TiDB cluster, the monitoring system (Prometheus/Grafana) is deployed at the same time. For more information, see [Overview of the Monitoring Framework](/tidb-monitoring-framework.md). + +The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, and so on. A lot of metrics are there to help you diagnose. + +You can get an overview of the component TiKV status from the TiKV dashboard, where the key metrics are displayed. According to the [Performance Map](https://asktug.com/_/tidb-performance-map/#/), you can check whether the status of the cluster is as expected. + +This document provides a detailed description of these key metrics. + +## Cluster + +- Store size: The storage size per TiKV instance +- Available size:The available capacity per TiKV instance +- Capacity size:The capacity size per TiKV instance +- CPU:The CPU usage per TiKV instance +- Memory:The memory usage per TiKV instance +- IO utilization:The I/O utilization per TiKV instance +- MBps:The total bytes of read and write in each TiKV instance +- QPS: The QPS per command in each TiKV instance +- Errps: The total number of gRPC message failures +- leader:The number of leaders per TiKV instance +- Region:The number of Regions per TiKV instance +- Uptime:The runtime of TiKV since last restart ![TiKV Dashboard - Cluster metrics](/media/tikv-dashboard-cluster.png) -### Errors +## Errors + +- Critical error:The number of critical errors +- Server is busy:Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, and so on. It should be `0` in normal case. +- Server report failures:The number of error messages reported by server. It should be `0` in normal case. +- Raftstore error:The number of Raftstore errors per type on each TiKV instance +- Scheduler error:The number of scheduler errors per type on each TiKV instance +- Coprocessor error:The number of coprocessor errors per type on each TiKV instance +- gRPC message error:The number of gRPC message errors per type on each TiKV instance +- Leader drop:The count of dropped leaders per TiKV instance +- Leader missing:The count of missing leaders per TiKV instance ![TiKV Dashboard - Errors metrics](/media/tikv-dashboard-errors.png) -### Server +## Server + +- CF size:The size of each column family +- Store size:The storage size per TiKV instance +- Channel full:The number of Channel Full errors per TiKV instance. It should be `0` in normal case. +- Active written leaders:The number of leaders being written on each TiKV instance +- Approximate Region size:The approximate Region size +- Approximate Region size Histogram:The histogram of approximate Region size +- Region average written keys:The average rate of written keys to Regions per TiKV instance +- Region average written bytes:The average rate of writing bytes to Regions per TiKV instance ![TiKV Dashboard - Server metrics](/media/tikv-dashboard-server.png) -### Raft IO +## gRPC + +- gRPC message count:The number of gRPC messages +- gRPC message failed:The number of failed gRPC messages +- 99% gRPC message duration:99% duration of gRPC messages +- Average gRPC message duration:Average duration of gRPC messages +- gRPC batch size:The batch size of gRPC messages between TiDB and TiKV +- raft message batch size:The batch size of raft messages + +## Thread CPU + +- Raft store CPU:The CPU utilization of the raftstore thread. The CPU usage should be less than 80% * `raftstore.store-pool-size` in normal case. +- Async apply CPU:The CPU utilization of async apply. The CPU usage should be less than 80% * `raftstore.apply-pool-size` in normal case. +- Scheduler worker CPU:The CPU utilization of scheduler. The CPU usage should be less than 90% * `storage.scheduler-worker-pool-size` in normal case. +- gRPC poll CPU:The CPU utilization of gRPC. The CPU usage should be less than 80% * `server.grpc-concurrency` in normal case. +- Unified read pool CPU:The CPU utilization of unified read pool +- Storage ReadPool CPU:The CPU utilization of readpool +- Coprocessor CPU:The CPU utilization of coprocessor +- RocksDB CPU:The CPU utilization of RocksDB +- Split check CPU:The CPU utilization of split check +- GC worker CPU:The CPU utilization of GC worker +- Snapshot worker CPU:The CPU utilization of snapshot worker + +## PD + +- PD requests:The count of requests that TiKV sends to PD +- PD request duration (average):The time consumed by requests that TiKV sends to PD +- PD heartbeats:The total number of PD heartbeat messages +- PD validate peers:The total number of peers validated by the PD worker + +## Raft IO + +- Apply log duration:Raft apply The time consumed for Raft to apply logs +- Apply log duration per server:The time consumed for Raft to apply logs per TiKV instance +- Append log duration:The time consumed for Raft to append logs +- Append log duration per server:The time consumed for Raft to append logs per TiKV instance +- Commit log duration:The time consumed for Raft to commit logs +- Commit log duration per server:The time consumed for Raft to commit logs per TiKV instance ![TiKV Dashboard - Raft IO metrics](/media/tikv-dashboard-raftio.png) -### Raft process +## Raft process + +- Ready handled:The count of handled ready buckets per region +- 0.99 Duration of Raft store events:The time consumed by raftstore events (P99) +- Process ready duration:The time consumed for processes to be ready in Raft +- Process ready duration per server:The time consumed for peer processes to be ready in Raft. It should be less than 2s(P99.99). ![TiKV Dashboard - Raft process metrics](/media/tikv-dashboard-raft-process.png) -### Raft message +## Raft message + +- Sent messages per server:The number of Raft messages sent by each TiKV instance +- Flush messages per server:The number of Raft messages flushed by each TiKV instance +- Receive messages per server:The number of Raft messages received by each TiKV instance +- Messages:The number of Raft messages sent per type +- Vote:The number of Vote messages sent in Raft +- Raft dropped messages:The number of dropped Raft messages per type ![TiKV Dashboard - Raft message metrics](/media/tikv-dashboard-raft-message.png) -### Raft proposal +## Raft propose -![TiKV Dashboard - Raft proposal metrics](/media/tikv-dashboard-raft-propose.png) +- Raft apply proposals per ready:The number of Raft proposals of all Regions per ready handled bucket +- Raft read/write proposals:The number of proposals per type +- Raft read proposals per server:The number of read proposals made by each TiKV instance +- Raft write proposals per server:The number of write proposals made by each TiKV instance +- Propose wait duration:The wait time of each proposal +- Propose wait duration per server:The wait time of each proposal per TiKV instance +- Apply wait duration:The apply time of each proposal +- Apply wait duration per server:The apply time of each proposal per TiKV instance +- Raft log speed:The rate at which peers propose logs -### Raft admin +![TiKV Dashboard - Raft propose metrics](/media/tikv-dashboard-raft-propose.png) + +## Raft admin + +- Admin proposals:The number of admin proposals +- Admin apply:The number of processed apply commands +- Check split:The number of raftstore split checks +- 99.99% Check split duration:The time consumed when running split checks (P99.99) ![TiKV Dashboard - Raft admin metrics](/media/tikv-dashboard-raft-admin.png) -### Local reader +## Local reader + +- Local reader requests:The number of total requests and the number of rejections from the local read thread ![TiKV Dashboard - Local reader metrics](/media/tikv-dashboard-local-reader.png) -### Storage +## Unified Read Pool -![TiKV Dashboard - Storage metrics](/media/tikv-dashboard-storage.png) +- Time used by level:The time consumed for each level in unified read pool, level 0 means small query +- Level 0 chance:The proportion of level 0 tasks in unified read pool +- Running tasks:The number of tasks running concurrently in the unified read pool -### Scheduler +## Storage -![TiKV Dashboard - Scheduler metrics](/media/tikv-dashboard-scheduler.png) +- Storage command total:The total number of received commands per type +- Storage async request error:The total number of engine asynchronous request errors +- Storage async snapshot duration:The time consumed by processing asynchronous snapshot requests. It should be less than `1s` in `.99`. +- Storage async write duration:The time consumed by processing asynchronous write requests. It should be less than `1s` in `.99`. -### Scheduler - batch_get +![TiKV Dashboard - Storage metrics](/media/tikv-dashboard-storage.png) + +## Scheduler -![TiKV Dashboard - Scheduler - batch_get metrics](/media/tikv-dashboard-scheduler-batch-get.png) +- Scheduler stage total:The total number of commands at each stage. There should not be lots of errors in a short time. +- Scheduler writing bytes:The total bytes of writing bytes per TiKV instance +- Scheduler priority commands:The count of different priority commands +- Scheduler pending commands:The count of pending commands per TiKV instance -### Scheduler - cleanup +![TiKV Dashboard - Scheduler metrics](/media/tikv-dashboard-scheduler.png) -![TiKV Dashboard - Scheduler - cleanup metrics](/media/tikv-dashboard-scheduler-cleanup.png) +## Scheduler - commit -### Scheduler - commit +- Scheduler stage total:The total number of commands at each stage when executing the commit command. There should not be lots of errors in a short time. +- Scheduler command duration:The time consumed when executing the commit command. It should be less than `1s`. +- Scheduler latch wait duration:The wait time caused by latch when executing the commit command. It should be less than `1s`. +- Scheduler keys read:The count of keys read by a commit command +- Scheduler keys written:The count of keys written by a commit command +- Scheduler scan details:The keys scan details of each CF when executing the commit command. +- Scheduler scan details [lock]:The keys scan details of lock CF when executing the commit command +- Scheduler scan details [write]:The keys scan details of write CF when executing the commit command +- Scheduler scan details [default]:The keys scan details of default CF when executing the commit command ![TiKV Dashboard - Scheduler commit metrics](/media/tikv-dashboard-scheduler-commit.png) + +## Scheduler - pessimistic_rollback + +- Scheduler stage total:The total number of commands at each stage when executing the pessimistic_rollback command. There should not be lots of errors in a short time. +- Scheduler command duration:The time consumed when executing the pessimistic_rollback command. It should be less than `1s`. +- Scheduler latch wait duration:The wait time caused by latch when executing the pessimistic_rollback command. It should be less than `1s`. +- Scheduler keys read:The count of keys read by a pessimistic_rollback command +- Scheduler keys written:The count of keys written by a pessimistic_rollback command +- Scheduler scan details:The keys scan details of each CF when executing the pessimistic_rollback command. +- Scheduler scan details [lock]:The keys scan details of lock CF when executing the pessimistic_rollback command +- Scheduler scan details [write]:The keys scan details of write CF when executing the pessimistic_rollback command +- Scheduler scan details [default]:The keys scan details of default CF when executing the pessimistic_rollback command + +## Scheduler - prewrite + +- Scheduler stage total:The total number of commands at each stage when executing the prewrite command. There should not be lots of errors in a short time. +- Scheduler command duration:The time consumed when executing the prewrite command. It should be less than `1s`. +- Scheduler latch wait duration:The wait time caused by latch when executing the prewrite command. It should be less than `1s`. +- Scheduler keys read:The count of keys read by a prewrite command +- Scheduler keys written:The count of keys written by a prewrite command +- Scheduler scan details:The keys scan details of each CF when executing the prewrite command. +- Scheduler scan details [lock]:The keys scan details of lock CF when executing the prewrite command +- Scheduler scan details [write]:The keys scan details of write CF when executing the prewrite command +- Scheduler scan details [default]:The keys scan details of default CF when executing the prewrite command + +## Scheduler - rollback + +- Scheduler stage total:The total number of commands at each stage when executing the rollback command. There should not be lots of errors in a short time. +- Scheduler command duration:The time consumed when executing the rollback command. It should be less than `1s`. +- Scheduler latch wait duration:The wait time caused by latch when executing the rollback command. It should be less than `1s`. +- Scheduler keys read:The count of keys read by a rollback command +- Scheduler keys written:The count of keys written by a rollback command +- Scheduler scan details:The keys scan details of each CF when executing the rollback command. +- Scheduler scan details [lock]:The keys scan details of lock CF when executing the rollback command +- Scheduler scan details [write]:The keys scan details of write CF when executing the rollback command +- Scheduler scan details [default]:The keys scan details of default CF when executing the rollback command + +## GC + +- MVCC versions:The number of versions for each key +- MVCC delete versions:The number of versions deleted by GC for each key +- GC tasks:The count of GC tasks processed by gc_worker +- GC tasks Duration:The time consumed when executing GC tasks +- GC keys (write CF):The count of keys in write CF affected during GC +- TiDB GC worker actions:The count of TiDB GC worker actions +- TiDB GC seconds:The GC duration +- GC speed:The number of keys deleted by GC per second +- TiKV AutoGC Working:The status of Auto GC +- ResolveLocks Progress:The progress of the first phase of GC(ResolveLocks) +- TiKV Auto GC Progress:The progress of the second phase of GC +- TiKV Auto GC SafePoint:TiKV GC safr point value, safe point is the current GC timestamp +- GC lifetime:The lifetime of TiDB GC +- GC interval:The interval of TiDB GC + +## Snapshot + +- Rate snapshot message:The rate at which Raft snapshot messages are sent +- 99% Handle snapshot duration:The time consumed to handle snapshots (P99) +- Snapshot state count:The number of snapshots per state +- 99.99% Snapshot size:The snapshot size (P99.99) +- 99.99% Snapshot KV count:The number of KV within a snapshot (P99.99) + +## Task + +- Worker handled tasks:The number of tasks handled by worker +- Worker pending tasks:Current number of pending and running tasks of worker. It should be less than `1000` in normal case. +- FuturePool handled tasks:The number of tasks handled by future_pool +- FuturePool pending tasks:Current number of pending and running tasks of future_pool + +## Coprocessor Overview + +- Request duration:The time consumed to handle coprocessor read requests +- Total Requests:The number of total coprocessor request +- Handle duration:The histogram of time spent actually processing coprocessor requests per minute +- Total Request Errors:The total number of the coprocessor request errors +- Total KV Cursor Operations:The total number of the KV cursor operations, such as select, index, analyze_table, analyze_index, checksum_table, checksum_index, and so on. +- KV Cursor Operations:The histogram of KV cursor operations +- Total RocksDB Perf Statistics:The performance statistics of RocksDB +- Total Response Size:The total size of coprocessor response + +## Coprocessor Detail + +- Handle duration:The histogram of time spent actually processing coprocessor requests per minute +- 95% Handle duration by store:The time consumed to handle coprocessor requests per TiKV instance (P95) +- Wait duration:The time consumed when coprocessor requests are waiting to be handled. It should be less than `10s`(P99.99). +- 95% Wait duration by store:The time consumed when coprocessor requests are waiting to be handled per TiKV instance (P95) +- Total DAG Requests:The total number of DAG requests +- Total DAG Executors:The total number of DAG executors +- Total Ops Details (Table Scan):The total number of RocksDB internal operations when executing select scan +- Total Ops Details (Index Scan):The total number of RocksDB internal operations when executing index scan +- Total Ops Details by CF (Table Scan):The select scan details for each CF +- Total Ops Details by CF (Index Scan):The index scan details for each CF + +## Threads + +- Threads state:The state of TiKV threads +- Threads IO:The I/O traffic of each TiKV thread +- Thread Voluntary Context Switches:The number of TiKV threads voluntary context switches +- Thread Nonvoluntary Context Switches:The number of TiKV threads nonvoluntary context switches + +## RocksDB - kv/raft + +- Get operations:The count of get operations +- Get duration:The time consumed when executing get operations +- Seek operations:The count of seek operations +- Seek duration:The time consumed when executing seek operations +- Write operations:The count of write operations +- Write duration:The time consumed when executing write operations +- WAL sync operations:The count of WAL sync operations +- Write WAL duration:The time consumed for writing WAL +- WAL sync duration:The time consumed when executing WAL sync operations +- Compaction operationsThe count of compaction and flush operations +- Compaction duration:The time consumed when executing the compaction and flush operations +- SST read duration:The time consumed when reading SST files +- Write stall duration:Write stall duration. It should be `0` in normal case. +- Memtable size:The memtable size of each column family +- Memtable hit:The hit rate of memtable +- Block cache size:The block cache size. Broken down by column family if shared block cache is disabled. +- Block cache hit:The hit rate of block cache +- Block cache flow:The flow rate of block cache operations per type +- Block cache operations: The count of block cache operations per type +- Keys flow:The flow rate of operations on keys per type +- Total keys:The count of keys in each column family +- Read flow:The flow rate of read operations per type +- Bytes / Read:The bytes per read operation +- Write flow:The flow rate of write operations per type +- Bytes / Write:The bytes per write operation +- Compaction flow:The flow rate of compaction operations per type +- Compaction pending bytes:The pending bytes to be compacted +- Read amplification:The read amplification per TiKV instance +- Compression ratio:The compression ratio of each level +- Number of snapshots:The number of snapshots per TiKV instance +- Oldest snapshots duration:The time that the oldest unreleased snapshot survivals +- Number files at each level:The number of SST files for different column families in each level +- Ingest SST duration seconds:The time consumed to ingest SST files +- Stall conditions changed of each CF:Stall conditions changed of each column family + +## Titan - All + +- Blob file count:The number of Titan blob file +- Blob file size:The total size of Titan blob file +- Live blob size:The total size of valid blob record +- Blob cache hit:The hit rate of Titan block cache +- Iter touched blob file count:The number of blob file involved in a single iterator +- Blob file discardable ratio distribution:The distribution of blob file failure blob record ratio +- Blob key size:The size of Titan blob keys +- Blob value size:The size of Titan blob values +- Blob get operations:The count of get operations in Titan blob +- Blob get duration:The time consumed when executing get operations in Titan blob +- Blob iter operations:The time consumed when executing iter operations in Titan blob +- Blob seek duration:The time consumed when executing seek operations in Titan blob +- Blob next duration:The time consumed when executing next operations in Titan blob +- Blob prev duration:The time consumed when executing prev operations in Titan blob +- Blob keys flow:The flow rate of operations on Titan blob keys +- Blob bytes flow:The flow rate of bytes on Titan blob keys +- Blob file read duration:The time consumed when reading Titan blob file +- Blob file write duration:The time consumed when writing Titan blob file +- Blob file sync operations:The count of blob file sync operations +- Blob file sync duration:The time consumed when sync blob file +- Blob GC action:The count of Titan GC actions +- Blob GC duration:The Titan GC duration +- Blob GC keys flow:The flow rate of keys read and written by Titan GC +- Blob GC bytes flow:The flow rate of bytes read and written by Titan GC +- Blob GC input file size:The size of Titan GC input file +- Blob GC output file size:The size of Titan GC output file +- Blob GC file count:The count of blob files involved in Titan GC + +## Lock manager + +- Thread CPU:The CPU utilization of the lock manager thread +- Handled tasks:The number of taks handled by lock manager +- Waiter lifetime duration:The time consumed for the transaction waitting for the lock to be released +- Wait table:The status information of wait table, including the number of locks and the number of transactions waitting for the lock +- Deadlock detect duration:The time consumed for detecting deadlock +- Detect error:The number of errors encountered when detecting deadlock, including the number of deadlocks +- Deadlock detector leader:The information about the node where the deadlock detector leader is located + +## Memory + +- Allocator Stats:The statistics of the memory allocator + +## Backup + +- Backup CPU:The CPU utilization of the backup thread +- Range Size:The histogram of backup range size +- Backup Duration:The time consumed for backup +- Backup Flow:The total bytes of backup +- Disk Throughput:The disk throughput per instance +- Backup Range Duration:The time consumed for range backup +- Backup Errors:The number of errors encountered when making a backup + +## Encryption + +- Encryption data keys:The total number of encrypted data keys +- Encrypted files:The number of encrypted files +- Encryption initialized:It shows whether encryption is enabled, `1` means enabled. +- Encryption meta files size:The size of meta file about encrpytion +- Encrypt/decrypt data nanos:The histogram of time on encrypting/decrypting data ecch time +- Read/write encryption meta duration:The time consumed for reading/writing encryption meta file + +## 面板常见参数的解释 + +### gRPC 消息类型 + +1. 使用事务型接口的命令: + + - kv_get:The command of getting the latest version of data specified by ts + - kv_scan:The command of scanning a continuous piece of data + - kv_prewrite:The command of prewriting the data to be committed at first phase of 2PC + - kv_pessimistic_lock:The command of adding a pessimistic lock to the key to prevent other transaction from modifying + - kv_pessimistic_rollback:The command of deleting the pessimistic lock on the key + - kv_txn_heart_beat:The command of updating `lock_ttl` for pessimistic transactions or large transactions to prevent them from rolling back + - kv_check_txn_status:The command of checking the status of the transaction + - kv_commit:The command of committing the data written by prewrite command + - kv_cleanup:The command of rolling back a transaction, it will abolished in 4.0 + - kv_batch_get:The command of getting the value of batch key at once, similar to `kv_get`. + - kv_batch_rollback:The command of batch rollback of multiple prewrite transaction + - kv_scan_lock:The command of scanning all locks with a version number before `max_version` to clean up expired transactions + - kv_resolve_lock:The command of committing or rollback the transaction lock, according to the transaction status. + - kv_gc:The command of GC + - kv_delete_range:The command of deleting a continuous piece of data from TiKV + +2. 非事务型的裸命令: + + - raw_get:The command of getting the value of key + - raw_batch_get:The command of getting the value of batch keys + - raw_scan:The command of scanning a continuous piece of data + - raw_batch_scan:The command of scanning multiple consecutive data + - raw_put:The command of writing a key/value pair + - raw_batch_put:The command of writing a batch of key/value pairs + - raw_delete:The command of deleting a key/value pair + - raw_batch_delete:The command of a batch of key/value pairs + - raw_delete_range:The command of deleting a continuous interval From e06f91c0ecf8808e7fba410d8c4031096ffec3fb Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Fri, 26 Jun 2020 23:36:12 +0800 Subject: [PATCH 02/13] fix lint --- grafana-tikv-dashboard.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index f98774a9e5aa8..465b6c9bdc1e8 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -24,8 +24,8 @@ This document provides a detailed description of these key metrics. - Memory:The memory usage per TiKV instance - IO utilization:The I/O utilization per TiKV instance - MBps:The total bytes of read and write in each TiKV instance -- QPS: The QPS per command in each TiKV instance -- Errps: The total number of gRPC message failures +- QPS:The QPS per command in each TiKV instance +- Errps:The total number of gRPC message failures - leader:The number of leaders per TiKV instance - Region:The number of Regions per TiKV instance - Uptime:The runtime of TiKV since last restart From 763e69d44b17259e473db23ec12ee39d72e9bd7a Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Sun, 28 Jun 2020 19:18:58 +0800 Subject: [PATCH 03/13] Fix first round of review --- grafana-tikv-dashboard.md | 72 +++++++++++++++++++-------------------- 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 465b6c9bdc1e8..0c1908d6cdeee 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -5,15 +5,15 @@ category: reference aliases: ['/docs/dev/grafana-tikv-dashboard/','/docs/dev/reference/key-monitoring-metrics/tikv-dashboard/'] --- -# The metrics description of TiKV +# Description of TiKV Monitoring Metrics If you use TiUP to deploy the TiDB cluster, the monitoring system (Prometheus/Grafana) is deployed at the same time. For more information, see [Overview of the Monitoring Framework](/tidb-monitoring-framework.md). -The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node\_exporter, and so on. A lot of metrics are there to help you diagnose. +The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, and so on. A lot of metrics are there to help you diagnose. -You can get an overview of the component TiKV status from the TiKV dashboard, where the key metrics are displayed. According to the [Performance Map](https://asktug.com/_/tidb-performance-map/#/), you can check whether the status of the cluster is as expected. +You can get an overview of the component TiKV status from the **TiKV-Details** dashboard, where the key metrics are displayed. According to the [Performance Map](https://asktug.com/_/tidb-performance-map/#/), you can check whether the status of the cluster is as expected. -This document provides a detailed description of these key metrics. +This document provides a detailed description of these key metrics on the **TiKV-Details** dashboard. ## Cluster @@ -25,7 +25,7 @@ This document provides a detailed description of these key metrics. - IO utilization:The I/O utilization per TiKV instance - MBps:The total bytes of read and write in each TiKV instance - QPS:The QPS per command in each TiKV instance -- Errps:The total number of gRPC message failures +- Errps:The rate of gRPC message failures - leader:The number of leaders per TiKV instance - Region:The number of Regions per TiKV instance - Uptime:The runtime of TiKV since last restart @@ -53,27 +53,27 @@ This document provides a detailed description of these key metrics. - Channel full:The number of Channel Full errors per TiKV instance. It should be `0` in normal case. - Active written leaders:The number of leaders being written on each TiKV instance - Approximate Region size:The approximate Region size -- Approximate Region size Histogram:The histogram of approximate Region size -- Region average written keys:The average rate of written keys to Regions per TiKV instance -- Region average written bytes:The average rate of writing bytes to Regions per TiKV instance +- Approximate Region size Histogram:The histogram of each approximate Region size +- Region average written keys:The average number of written keys to Regions per TiKV instance +- Region average written bytes: The average written bytes to Regions per TiKV instance ![TiKV Dashboard - Server metrics](/media/tikv-dashboard-server.png) ## gRPC -- gRPC message count:The number of gRPC messages +- gRPC message count: The number of gRPC messages per type - gRPC message failed:The number of failed gRPC messages -- 99% gRPC message duration:99% duration of gRPC messages -- Average gRPC message duration:Average duration of gRPC messages +- 99% gRPC message duration: The gRPC message duration per message type (P99) +- Average gRPC message duration: The average execution time of gRPC messages - gRPC batch size:The batch size of gRPC messages between TiDB and TiKV -- raft message batch size:The batch size of raft messages +- Raft message batch size:The batch size of Raft messages between TiKV instances ## Thread CPU - Raft store CPU:The CPU utilization of the raftstore thread. The CPU usage should be less than 80% * `raftstore.store-pool-size` in normal case. -- Async apply CPU:The CPU utilization of async apply. The CPU usage should be less than 80% * `raftstore.apply-pool-size` in normal case. -- Scheduler worker CPU:The CPU utilization of scheduler. The CPU usage should be less than 90% * `storage.scheduler-worker-pool-size` in normal case. -- gRPC poll CPU:The CPU utilization of gRPC. The CPU usage should be less than 80% * `server.grpc-concurrency` in normal case. +- Async apply CPU:The CPU utilization of the `async apply` thread. The CPU usage should be less than 90% * `raftstore.apply-pool-size` in normal cases. +- Scheduler worker CPU:The CPU utilization of the `scheduler worker` thread. The CPU usage should be less than 90% * `storage.scheduler-worker-pool-size` in normal cases. +- gRPC poll CPU:The CPU utilization of the `gRPC` thread. The CPU usage should be less than 80% * `server.grpc-concurrency` in normal cases. - Unified read pool CPU:The CPU utilization of unified read pool - Storage ReadPool CPU:The CPU utilization of readpool - Coprocessor CPU:The CPU utilization of coprocessor @@ -85,13 +85,13 @@ This document provides a detailed description of these key metrics. ## PD - PD requests:The count of requests that TiKV sends to PD -- PD request duration (average):The time consumed by requests that TiKV sends to PD +- PD request duration (average):The average time consumed by requests that TiKV sends to PD - PD heartbeats:The total number of PD heartbeat messages - PD validate peers:The total number of peers validated by the PD worker ## Raft IO -- Apply log duration:Raft apply The time consumed for Raft to apply logs +- Apply log duration:The time consumed for Raft to apply logs - Apply log duration per server:The time consumed for Raft to apply logs per TiKV instance - Append log duration:The time consumed for Raft to append logs - Append log duration per server:The time consumed for Raft to append logs per TiKV instance @@ -102,35 +102,35 @@ This document provides a detailed description of these key metrics. ## Raft process -- Ready handled:The count of handled ready buckets per region +- Ready handled:The count of handled ready operations per second - 0.99 Duration of Raft store events:The time consumed by raftstore events (P99) - Process ready duration:The time consumed for processes to be ready in Raft -- Process ready duration per server:The time consumed for peer processes to be ready in Raft. It should be less than 2s(P99.99). +- Process ready duration per server:The time consumed for peer processes to be ready in Raft. It should be less than 2 seconds (P99.99). ![TiKV Dashboard - Raft process metrics](/media/tikv-dashboard-raft-process.png) ## Raft message -- Sent messages per server:The number of Raft messages sent by each TiKV instance -- Flush messages per server:The number of Raft messages flushed by each TiKV instance -- Receive messages per server:The number of Raft messages received by each TiKV instance -- Messages:The number of Raft messages sent per type -- Vote:The number of Vote messages sent in Raft +- Sent messages per server:The number of Raft messages sent per second by each TiKV instance +- Flush messages per server:The number of Raft messages flushed per second by the Raft client in each TiKV instance +- Receive messages per server:The number of Raft messages received per second by each TiKV instance +- Messages:The number of Raft messages sent per type per second +- Vote:The number of Vote messages sent in Raft per second - Raft dropped messages:The number of dropped Raft messages per type ![TiKV Dashboard - Raft message metrics](/media/tikv-dashboard-raft-message.png) ## Raft propose -- Raft apply proposals per ready:The number of Raft proposals of all Regions per ready handled bucket +- Raft apply proposals per ready:The histogram of the number of proposals that each ready operation containes in a batch while applying proposal. - Raft read/write proposals:The number of proposals per type - Raft read proposals per server:The number of read proposals made by each TiKV instance - Raft write proposals per server:The number of write proposals made by each TiKV instance -- Propose wait duration:The wait time of each proposal -- Propose wait duration per server:The wait time of each proposal per TiKV instance -- Apply wait duration:The apply time of each proposal -- Apply wait duration per server:The apply time of each proposal per TiKV instance -- Raft log speed:The rate at which peers propose logs +- Propose wait duration:The histogram of wait time of each proposal +- Propose wait duration per server:The histogram of wait time of each proposal per TiKV instance +- Apply wait duration:The histogram of apply time of each proposal +- Apply wait duration per server:The histogram of apply time of each proposal per TiKV instance +- Raft log speed:The average rate at which peers propose logs ![TiKV Dashboard - Raft propose metrics](/media/tikv-dashboard-raft-propose.png) @@ -138,8 +138,8 @@ This document provides a detailed description of these key metrics. - Admin proposals:The number of admin proposals - Admin apply:The number of processed apply commands -- Check split:The number of raftstore split checks -- 99.99% Check split duration:The time consumed when running split checks (P99.99) +- Check split:The number of raftstore split check commands +- 99.99% Check split duration:The time consumed when running split check commands (P99.99) ![TiKV Dashboard - Raft admin metrics](/media/tikv-dashboard-raft-admin.png) @@ -386,11 +386,11 @@ This document provides a detailed description of these key metrics. - Encrypt/decrypt data nanos:The histogram of time on encrypting/decrypting data ecch time - Read/write encryption meta duration:The time consumed for reading/writing encryption meta file -## 面板常见参数的解释 +## Explanation of Common Parameters -### gRPC 消息类型 +### gRPC Message Type -1. 使用事务型接口的命令: +1. Transactional API: - kv_get:The command of getting the latest version of data specified by ts - kv_scan:The command of scanning a continuous piece of data @@ -408,7 +408,7 @@ This document provides a detailed description of these key metrics. - kv_gc:The command of GC - kv_delete_range:The command of deleting a continuous piece of data from TiKV -2. 非事务型的裸命令: +2. Raw API: - raw_get:The command of getting the value of key - raw_batch_get:The command of getting the value of batch keys From ba0a764c28f763e6bea0c460eeaaf8561933c92b Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Sun, 28 Jun 2020 19:23:33 +0800 Subject: [PATCH 04/13] fix --- grafana-tikv-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 0c1908d6cdeee..295fef15267a9 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -116,7 +116,7 @@ This document provides a detailed description of these key metrics on the **TiKV - Receive messages per server:The number of Raft messages received per second by each TiKV instance - Messages:The number of Raft messages sent per type per second - Vote:The number of Vote messages sent in Raft per second -- Raft dropped messages:The number of dropped Raft messages per type +- Raft dropped messages:The number of dropped Raft messages per type per second ![TiKV Dashboard - Raft message metrics](/media/tikv-dashboard-raft-message.png) From ffe3c1be5366fb70f4ace057853857dfe576befc Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Sun, 28 Jun 2020 23:04:28 +0800 Subject: [PATCH 05/13] Fix some errors like ops --- grafana-tikv-dashboard.md | 102 +++++++++++++++++++------------------- 1 file changed, 51 insertions(+), 51 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 295fef15267a9..3931030223e2f 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -61,8 +61,8 @@ This document provides a detailed description of these key metrics on the **TiKV ## gRPC -- gRPC message count: The number of gRPC messages per type -- gRPC message failed:The number of failed gRPC messages +- gRPC message count: The rate of gRPC messages per type +- gRPC message failed:The rate of failed gRPC messages - 99% gRPC message duration: The gRPC message duration per message type (P99) - Average gRPC message duration: The average execution time of gRPC messages - gRPC batch size:The batch size of gRPC messages between TiDB and TiKV @@ -70,24 +70,24 @@ This document provides a detailed description of these key metrics on the **TiKV ## Thread CPU -- Raft store CPU:The CPU utilization of the raftstore thread. The CPU usage should be less than 80% * `raftstore.store-pool-size` in normal case. +- Raft store CPU:The CPU utilization of the `raftstore` thread. The CPU usage should be less than 80% * `raftstore.store-pool-size` in normal case. - Async apply CPU:The CPU utilization of the `async apply` thread. The CPU usage should be less than 90% * `raftstore.apply-pool-size` in normal cases. - Scheduler worker CPU:The CPU utilization of the `scheduler worker` thread. The CPU usage should be less than 90% * `storage.scheduler-worker-pool-size` in normal cases. - gRPC poll CPU:The CPU utilization of the `gRPC` thread. The CPU usage should be less than 80% * `server.grpc-concurrency` in normal cases. -- Unified read pool CPU:The CPU utilization of unified read pool -- Storage ReadPool CPU:The CPU utilization of readpool -- Coprocessor CPU:The CPU utilization of coprocessor -- RocksDB CPU:The CPU utilization of RocksDB -- Split check CPU:The CPU utilization of split check -- GC worker CPU:The CPU utilization of GC worker -- Snapshot worker CPU:The CPU utilization of snapshot worker +- Unified read pool CPU:The CPU utilization of `unified read pool` thread +- Storage ReadPool CPU:The CPU utilization of `storage read pool` thread +- Coprocessor CPU:The CPU utilization of `coprocessor` thread +- RocksDB CPU:The CPU utilization of RocksDB thread +- Split check CPU:The CPU utilization of `split check` thread +- GC worker CPU:The CPU utilization of `GC worker` thread +- Snapshot worker CPU:The CPU utilization of `snapshot worker` thread ## PD -- PD requests:The count of requests that TiKV sends to PD +- PD requests:The rate of requests that TiKV sends to PD - PD request duration (average):The average time consumed by requests that TiKV sends to PD -- PD heartbeats:The total number of PD heartbeat messages -- PD validate peers:The total number of peers validated by the PD worker +- PD heartbeats:The rate of heartbeat messages sended from TiKV to PD +- PD validate peers:The rate of messages that sended from TiKV to PD to validate peer ## Raft IO @@ -105,15 +105,15 @@ This document provides a detailed description of these key metrics on the **TiKV - Ready handled:The count of handled ready operations per second - 0.99 Duration of Raft store events:The time consumed by raftstore events (P99) - Process ready duration:The time consumed for processes to be ready in Raft -- Process ready duration per server:The time consumed for peer processes to be ready in Raft. It should be less than 2 seconds (P99.99). +- Process ready duration per server:The time consumed for peer processes to be ready in Raft per TiKV instance. It should be less than 2 seconds (P99.99). ![TiKV Dashboard - Raft process metrics](/media/tikv-dashboard-raft-process.png) ## Raft message -- Sent messages per server:The number of Raft messages sent per second by each TiKV instance -- Flush messages per server:The number of Raft messages flushed per second by the Raft client in each TiKV instance -- Receive messages per server:The number of Raft messages received per second by each TiKV instance +- Sent messages per server:The number of Raft messages sent by each TiKV instance per second +- Flush messages per server:The number of Raft messages flushed by the Raft client in each TiKV instance per second +- Receive messages per server:The number of Raft messages received by each TiKV instance per second - Messages:The number of Raft messages sent per type per second - Vote:The number of Vote messages sent in Raft per second - Raft dropped messages:The number of dropped Raft messages per type per second @@ -123,9 +123,9 @@ This document provides a detailed description of these key metrics on the **TiKV ## Raft propose - Raft apply proposals per ready:The histogram of the number of proposals that each ready operation containes in a batch while applying proposal. -- Raft read/write proposals:The number of proposals per type -- Raft read proposals per server:The number of read proposals made by each TiKV instance -- Raft write proposals per server:The number of write proposals made by each TiKV instance +- Raft read/write proposals:The number of proposals per type per second +- Raft read proposals per server:The number of read proposals made by each TiKV instance per second +- Raft write proposals per server:The number of write proposals made by each TiKV instance per second - Propose wait duration:The histogram of wait time of each proposal - Propose wait duration per server:The histogram of wait time of each proposal per TiKV instance - Apply wait duration:The histogram of apply time of each proposal @@ -136,9 +136,9 @@ This document provides a detailed description of these key metrics on the **TiKV ## Raft admin -- Admin proposals:The number of admin proposals -- Admin apply:The number of processed apply commands -- Check split:The number of raftstore split check commands +- Admin proposals:The number of admin proposals per second +- Admin apply:The number of processed apply commands per second +- Check split:The number of raftstore split check commands per second - 99.99% Check split duration:The time consumed when running split check commands (P99.99) ![TiKV Dashboard - Raft admin metrics](/media/tikv-dashboard-raft-admin.png) @@ -157,8 +157,8 @@ This document provides a detailed description of these key metrics on the **TiKV ## Storage -- Storage command total:The total number of received commands per type -- Storage async request error:The total number of engine asynchronous request errors +- Storage command total:The number of received command by type per second +- Storage async request error:The number of engine asynchronous request errors per second - Storage async snapshot duration:The time consumed by processing asynchronous snapshot requests. It should be less than `1s` in `.99`. - Storage async write duration:The time consumed by processing asynchronous write requests. It should be less than `1s` in `.99`. @@ -166,7 +166,7 @@ This document provides a detailed description of these key metrics on the **TiKV ## Scheduler -- Scheduler stage total:The total number of commands at each stage. There should not be lots of errors in a short time. +- Scheduler stage total:The number of commands at each stage per second. There should not be lots of errors in a short time. - Scheduler writing bytes:The total bytes of writing bytes per TiKV instance - Scheduler priority commands:The count of different priority commands - Scheduler pending commands:The count of pending commands per TiKV instance @@ -175,7 +175,7 @@ This document provides a detailed description of these key metrics on the **TiKV ## Scheduler - commit -- Scheduler stage total:The total number of commands at each stage when executing the commit command. There should not be lots of errors in a short time. +- Scheduler stage total:The number of commands at each stage per second when executing the commit command. There should not be lots of errors in a short time. - Scheduler command duration:The time consumed when executing the commit command. It should be less than `1s`. - Scheduler latch wait duration:The wait time caused by latch when executing the commit command. It should be less than `1s`. - Scheduler keys read:The count of keys read by a commit command @@ -189,7 +189,7 @@ This document provides a detailed description of these key metrics on the **TiKV ## Scheduler - pessimistic_rollback -- Scheduler stage total:The total number of commands at each stage when executing the pessimistic_rollback command. There should not be lots of errors in a short time. +- Scheduler stage total:The number of commands at each stage per second when executing the pessimistic_rollback command. There should not be lots of errors in a short time. - Scheduler command duration:The time consumed when executing the pessimistic_rollback command. It should be less than `1s`. - Scheduler latch wait duration:The wait time caused by latch when executing the pessimistic_rollback command. It should be less than `1s`. - Scheduler keys read:The count of keys read by a pessimistic_rollback command @@ -201,7 +201,7 @@ This document provides a detailed description of these key metrics on the **TiKV ## Scheduler - prewrite -- Scheduler stage total:The total number of commands at each stage when executing the prewrite command. There should not be lots of errors in a short time. +- Scheduler stage total:The number of commands at each stage per second when executing the prewrite command. There should not be lots of errors in a short time. - Scheduler command duration:The time consumed when executing the prewrite command. It should be less than `1s`. - Scheduler latch wait duration:The wait time caused by latch when executing the prewrite command. It should be less than `1s`. - Scheduler keys read:The count of keys read by a prewrite command @@ -213,7 +213,7 @@ This document provides a detailed description of these key metrics on the **TiKV ## Scheduler - rollback -- Scheduler stage total:The total number of commands at each stage when executing the rollback command. There should not be lots of errors in a short time. +- Scheduler stage total:The number of commands at each stage per second when executing the rollback command. There should not be lots of errors in a short time. - Scheduler command duration:The time consumed when executing the rollback command. It should be less than `1s`. - Scheduler latch wait duration:The wait time caused by latch when executing the rollback command. It should be less than `1s`. - Scheduler keys read:The count of keys read by a rollback command @@ -236,7 +236,7 @@ This document provides a detailed description of these key metrics on the **TiKV - TiKV AutoGC Working:The status of Auto GC - ResolveLocks Progress:The progress of the first phase of GC(ResolveLocks) - TiKV Auto GC Progress:The progress of the second phase of GC -- TiKV Auto GC SafePoint:TiKV GC safr point value, safe point is the current GC timestamp +- TiKV Auto GC SafePoint:TiKV GC safe point value, safe point is the current GC timestamp - GC lifetime:The lifetime of TiDB GC - GC interval:The interval of TiDB GC @@ -250,19 +250,19 @@ This document provides a detailed description of these key metrics on the **TiKV ## Task -- Worker handled tasks:The number of tasks handled by worker -- Worker pending tasks:Current number of pending and running tasks of worker. It should be less than `1000` in normal case. -- FuturePool handled tasks:The number of tasks handled by future_pool -- FuturePool pending tasks:Current number of pending and running tasks of future_pool +- Worker handled tasks:The number of tasks handled by worker persecond +- Worker pending tasks:Current number of pending and running tasks of worker per second. It should be less than `1000` in normal case. +- FuturePool handled tasks:The number of tasks handled by future_pool per second +- FuturePool pending tasks:Current number of pending and running tasks of future_pool per second ## Coprocessor Overview -- Request duration:The time consumed to handle coprocessor read requests -- Total Requests:The number of total coprocessor request +- Request duration:The total time spent from receiving the coprocessor request to the end of processing +- Total Requests:The number of requests by type per second - Handle duration:The histogram of time spent actually processing coprocessor requests per minute -- Total Request Errors:The total number of the coprocessor request errors -- Total KV Cursor Operations:The total number of the KV cursor operations, such as select, index, analyze_table, analyze_index, checksum_table, checksum_index, and so on. -- KV Cursor Operations:The histogram of KV cursor operations +- Total Request Errors:The number of request error of Coprocessor. There should not be lots of errors in a short time. +- Total KV Cursor Operations:The total number of the KV cursor operations by type per second, such as select, index, analyze_table, analyze_index, checksum_table, checksum_index, and so on. +- KV Cursor Operations:The histogram of KV cursor operations by type per second - Total RocksDB Perf Statistics:The performance statistics of RocksDB - Total Response Size:The total size of coprocessor response @@ -272,12 +272,12 @@ This document provides a detailed description of these key metrics on the **TiKV - 95% Handle duration by store:The time consumed to handle coprocessor requests per TiKV instance (P95) - Wait duration:The time consumed when coprocessor requests are waiting to be handled. It should be less than `10s`(P99.99). - 95% Wait duration by store:The time consumed when coprocessor requests are waiting to be handled per TiKV instance (P95) -- Total DAG Requests:The total number of DAG requests -- Total DAG Executors:The total number of DAG executors -- Total Ops Details (Table Scan):The total number of RocksDB internal operations when executing select scan -- Total Ops Details (Index Scan):The total number of RocksDB internal operations when executing index scan -- Total Ops Details by CF (Table Scan):The select scan details for each CF -- Total Ops Details by CF (Index Scan):The index scan details for each CF +- Total DAG Requests:The number of DAG requests per second +- Total DAG Executors:The number of DAG executors per second +- Total Ops Details (Table Scan):The number of RocksDB internal operations per second when executing select scan in coprocessor +- Total Ops Details (Index Scan):The number of RocksDB internal operations per second when executing index scan in coprocessor +- Total Ops Details by CF (Table Scan):The number of RocksDB internal operations for each CF per second when executing select scan in coprocessor +- Total Ops Details by CF (Index Scan):The number of RocksDB internal operations for each CF per second when executing index scan in coprocessor ## Threads @@ -288,16 +288,16 @@ This document provides a detailed description of these key metrics on the **TiKV ## RocksDB - kv/raft -- Get operations:The count of get operations +- Get operations:The count of get operations per second - Get duration:The time consumed when executing get operations -- Seek operations:The count of seek operations +- Seek operations:The count of seek operations per second - Seek duration:The time consumed when executing seek operations -- Write operations:The count of write operations +- Write operations:The count of write operations per second - Write duration:The time consumed when executing write operations -- WAL sync operations:The count of WAL sync operations +- WAL sync operations:The count of WAL sync operations per second - Write WAL duration:The time consumed for writing WAL - WAL sync duration:The time consumed when executing WAL sync operations -- Compaction operationsThe count of compaction and flush operations +- Compaction operationsThe count of compaction and flush operations per second - Compaction duration:The time consumed when executing the compaction and flush operations - SST read duration:The time consumed when reading SST files - Write stall duration:Write stall duration. It should be `0` in normal case. From a85ad8c3fc6676120a750c5da7089e843bc6b50d Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Fri, 3 Jul 2020 23:41:35 +0800 Subject: [PATCH 06/13] fix ditto --- grafana-tikv-dashboard.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 3931030223e2f..944706ae49fca 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -393,7 +393,7 @@ This document provides a detailed description of these key metrics on the **TiKV 1. Transactional API: - kv_get:The command of getting the latest version of data specified by ts - - kv_scan:The command of scanning a continuous piece of data + - kv_scan:The command of scanning a range of data - kv_prewrite:The command of prewriting the data to be committed at first phase of 2PC - kv_pessimistic_lock:The command of adding a pessimistic lock to the key to prevent other transaction from modifying - kv_pessimistic_rollback:The command of deleting the pessimistic lock on the key @@ -406,16 +406,16 @@ This document provides a detailed description of these key metrics on the **TiKV - kv_scan_lock:The command of scanning all locks with a version number before `max_version` to clean up expired transactions - kv_resolve_lock:The command of committing or rollback the transaction lock, according to the transaction status. - kv_gc:The command of GC - - kv_delete_range:The command of deleting a continuous piece of data from TiKV + - kv_delete_range:The command of deleting a range of data from TiKV 2. Raw API: - raw_get:The command of getting the value of key - raw_batch_get:The command of getting the value of batch keys - - raw_scan:The command of scanning a continuous piece of data + - raw_scan:The command of scanning a range of data - raw_batch_scan:The command of scanning multiple consecutive data - raw_put:The command of writing a key/value pair - raw_batch_put:The command of writing a batch of key/value pairs - raw_delete:The command of deleting a key/value pair - raw_batch_delete:The command of a batch of key/value pairs - - raw_delete_range:The command of deleting a continuous interval + - raw_delete_range:The command of deleting a range of data From bf2881a501242eb6c72474c3323640d0b2be5bb9 Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Wed, 8 Jul 2020 12:53:57 +0800 Subject: [PATCH 07/13] Update grafana-tikv-dashboard.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- grafana-tikv-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 944706ae49fca..c8779b3b4c3f4 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -151,7 +151,7 @@ This document provides a detailed description of these key metrics on the **TiKV ## Unified Read Pool -- Time used by level:The time consumed for each level in unified read pool, level 0 means small query +- Time used by level:The time consumed for each level in the unified read pool. Level 0 means small queries. - Level 0 chance:The proportion of level 0 tasks in unified read pool - Running tasks:The number of tasks running concurrently in the unified read pool From 2d21f2031386627752508dd05db48848c4067562 Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Wed, 8 Jul 2020 12:54:14 +0800 Subject: [PATCH 08/13] Update grafana-tikv-dashboard.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- grafana-tikv-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index c8779b3b4c3f4..42532c1e1c5d3 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -167,7 +167,7 @@ This document provides a detailed description of these key metrics on the **TiKV ## Scheduler - Scheduler stage total:The number of commands at each stage per second. There should not be lots of errors in a short time. -- Scheduler writing bytes:The total bytes of writing bytes per TiKV instance +- Scheduler writing bytes:The total written bytes of commands processed by each TiKV instance - Scheduler priority commands:The count of different priority commands - Scheduler pending commands:The count of pending commands per TiKV instance From 0e9613dfa2ec8a354d7716f9b100eee0d52cf35a Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Wed, 8 Jul 2020 14:13:21 +0800 Subject: [PATCH 09/13] Update grafana-tikv-dashboard --- grafana-tikv-dashboard.md | 58 +++++++++++++++++++-------------------- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 42532c1e1c5d3..cc15ae462435b 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -168,8 +168,8 @@ This document provides a detailed description of these key metrics on the **TiKV - Scheduler stage total:The number of commands at each stage per second. There should not be lots of errors in a short time. - Scheduler writing bytes:The total written bytes of commands processed by each TiKV instance -- Scheduler priority commands:The count of different priority commands -- Scheduler pending commands:The count of pending commands per TiKV instance +- Scheduler priority commands:The count of different priority commands per second +- Scheduler pending commands:The count of pending commands per TiKV instance per second ![TiKV Dashboard - Scheduler metrics](/media/tikv-dashboard-scheduler.png) @@ -234,9 +234,9 @@ This document provides a detailed description of these key metrics on the **TiKV - TiDB GC seconds:The GC duration - GC speed:The number of keys deleted by GC per second - TiKV AutoGC Working:The status of Auto GC -- ResolveLocks Progress:The progress of the first phase of GC(ResolveLocks) +- ResolveLocks Progress:The progress of the first phase of GC(Resolve Locks) - TiKV Auto GC Progress:The progress of the second phase of GC -- TiKV Auto GC SafePoint:TiKV GC safe point value, safe point is the current GC timestamp +- TiKV Auto GC SafePoint:The value of TiKV GC safe point. The safe point is the current GC timestamp - GC lifetime:The lifetime of TiDB GC - GC interval:The interval of TiDB GC @@ -250,17 +250,17 @@ This document provides a detailed description of these key metrics on the **TiKV ## Task -- Worker handled tasks:The number of tasks handled by worker persecond +- Worker handled tasks:The number of tasks handled by worker per second - Worker pending tasks:Current number of pending and running tasks of worker per second. It should be less than `1000` in normal case. -- FuturePool handled tasks:The number of tasks handled by future_pool per second -- FuturePool pending tasks:Current number of pending and running tasks of future_pool per second +- FuturePool handled tasks:The number of tasks handled by future pool per second +- FuturePool pending tasks:Current number of pending and running tasks of future pool per second ## Coprocessor Overview -- Request duration:The total time spent from receiving the coprocessor request to the end of processing +- Request duration:The total time spent from receiving the coprocessor request to the end of request processing - Total Requests:The number of requests by type per second - Handle duration:The histogram of time spent actually processing coprocessor requests per minute -- Total Request Errors:The number of request error of Coprocessor. There should not be lots of errors in a short time. +- Total Request Errors:The number of request errors of Coprocessor. There should not be lots of errors in a short time. - Total KV Cursor Operations:The total number of the KV cursor operations by type per second, such as select, index, analyze_table, analyze_index, checksum_table, checksum_index, and so on. - KV Cursor Operations:The histogram of KV cursor operations by type per second - Total RocksDB Perf Statistics:The performance statistics of RocksDB @@ -269,11 +269,11 @@ This document provides a detailed description of these key metrics on the **TiKV ## Coprocessor Detail - Handle duration:The histogram of time spent actually processing coprocessor requests per minute -- 95% Handle duration by store:The time consumed to handle coprocessor requests per TiKV instance (P95) -- Wait duration:The time consumed when coprocessor requests are waiting to be handled. It should be less than `10s`(P99.99). -- 95% Wait duration by store:The time consumed when coprocessor requests are waiting to be handled per TiKV instance (P95) -- Total DAG Requests:The number of DAG requests per second -- Total DAG Executors:The number of DAG executors per second +- 95% Handle duration by store:The time consumed to handle coprocessor requests per TiKV instance per second (P95) +- Wait duration:The time consumed when coprocessor requests are waiting to be handled. It should be less than `10s` (P99.99). +- 95% Wait duration by store:The time consumed when coprocessor requests are waiting to be handled per TiKV instance per second (P95) +- Total DAG Requests:The total number of DAG requests per second +- Total DAG Executors:The total number of DAG executors per second - Total Ops Details (Table Scan):The number of RocksDB internal operations per second when executing select scan in coprocessor - Total Ops Details (Index Scan):The number of RocksDB internal operations per second when executing index scan in coprocessor - Total Ops Details by CF (Table Scan):The number of RocksDB internal operations for each CF per second when executing select scan in coprocessor @@ -297,7 +297,7 @@ This document provides a detailed description of these key metrics on the **TiKV - WAL sync operations:The count of WAL sync operations per second - Write WAL duration:The time consumed for writing WAL - WAL sync duration:The time consumed when executing WAL sync operations -- Compaction operationsThe count of compaction and flush operations per second +- Compaction operations: The count of compaction and flush operations per second - Compaction duration:The time consumed when executing the compaction and flush operations - SST read duration:The time consumed when reading SST files - Write stall duration:Write stall duration. It should be `0` in normal case. @@ -325,12 +325,12 @@ This document provides a detailed description of these key metrics on the **TiKV ## Titan - All -- Blob file count:The number of Titan blob file +- Blob file count:The number of Titan blob files - Blob file size:The total size of Titan blob file - Live blob size:The total size of valid blob record - Blob cache hit:The hit rate of Titan block cache - Iter touched blob file count:The number of blob file involved in a single iterator -- Blob file discardable ratio distribution:The distribution of blob file failure blob record ratio +- Blob file discardable ratio distribution:The ratio distribution of blob record failure of blob files - Blob key size:The size of Titan blob keys - Blob value size:The size of Titan blob values - Blob get operations:The count of get operations in Titan blob @@ -344,7 +344,7 @@ This document provides a detailed description of these key metrics on the **TiKV - Blob file read duration:The time consumed when reading Titan blob file - Blob file write duration:The time consumed when writing Titan blob file - Blob file sync operations:The count of blob file sync operations -- Blob file sync duration:The time consumed when sync blob file +- Blob file sync duration:The time consumed when synchronizing blob file - Blob GC action:The count of Titan GC actions - Blob GC duration:The Titan GC duration - Blob GC keys flow:The flow rate of keys read and written by Titan GC @@ -357,11 +357,11 @@ This document provides a detailed description of these key metrics on the **TiKV - Thread CPU:The CPU utilization of the lock manager thread - Handled tasks:The number of taks handled by lock manager -- Waiter lifetime duration:The time consumed for the transaction waitting for the lock to be released -- Wait table:The status information of wait table, including the number of locks and the number of transactions waitting for the lock +- Waiter lifetime duration:The waiting time of the transaction for the lock to be released +- Wait table:The status information of wait table, including the number of locks and the number of transactions waiting for the lock - Deadlock detect duration:The time consumed for detecting deadlock - Detect error:The number of errors encountered when detecting deadlock, including the number of deadlocks -- Deadlock detector leader:The information about the node where the deadlock detector leader is located +- Deadlock detector leader:The information of the node where the deadlock detector leader is located ## Memory @@ -374,16 +374,16 @@ This document provides a detailed description of these key metrics on the **TiKV - Backup Duration:The time consumed for backup - Backup Flow:The total bytes of backup - Disk Throughput:The disk throughput per instance -- Backup Range Duration:The time consumed for range backup -- Backup Errors:The number of errors encountered when making a backup +- Backup Range Duration:The time consumed for backing up a range +- Backup Errors:The number of errors encountered during a backup ## Encryption - Encryption data keys:The total number of encrypted data keys - Encrypted files:The number of encrypted files -- Encryption initialized:It shows whether encryption is enabled, `1` means enabled. -- Encryption meta files size:The size of meta file about encrpytion -- Encrypt/decrypt data nanos:The histogram of time on encrypting/decrypting data ecch time +- Encryption initialized:Shows whether encryption is enabled, `1` means enabled. +- Encryption meta files size:The size of the encryption meta file +- Encrypt/decrypt data nanos:The histogram of duration on encrypting/decrypting data each time - Read/write encryption meta duration:The time consumed for reading/writing encryption meta file ## Explanation of Common Parameters @@ -395,12 +395,12 @@ This document provides a detailed description of these key metrics on the **TiKV - kv_get:The command of getting the latest version of data specified by ts - kv_scan:The command of scanning a range of data - kv_prewrite:The command of prewriting the data to be committed at first phase of 2PC - - kv_pessimistic_lock:The command of adding a pessimistic lock to the key to prevent other transaction from modifying + - kv_pessimistic_lock:The command of adding a pessimistic lock to the key to prevent other transaction from modifying this key - kv_pessimistic_rollback:The command of deleting the pessimistic lock on the key - kv_txn_heart_beat:The command of updating `lock_ttl` for pessimistic transactions or large transactions to prevent them from rolling back - kv_check_txn_status:The command of checking the status of the transaction - kv_commit:The command of committing the data written by prewrite command - - kv_cleanup:The command of rolling back a transaction, it will abolished in 4.0 + - kv_cleanup:The command of rolling back a transaction, which is deprecated in v4.0 - kv_batch_get:The command of getting the value of batch key at once, similar to `kv_get`. - kv_batch_rollback:The command of batch rollback of multiple prewrite transaction - kv_scan_lock:The command of scanning all locks with a version number before `max_version` to clean up expired transactions @@ -413,7 +413,7 @@ This document provides a detailed description of these key metrics on the **TiKV - raw_get:The command of getting the value of key - raw_batch_get:The command of getting the value of batch keys - raw_scan:The command of scanning a range of data - - raw_batch_scan:The command of scanning multiple consecutive data + - raw_batch_scan:The command of scanning multiple consecutive data range - raw_put:The command of writing a key/value pair - raw_batch_put:The command of writing a batch of key/value pairs - raw_delete:The command of deleting a key/value pair From b672b2a2d356ae0ddd84333324027c60f482bacb Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 14 Jul 2020 14:34:07 +0800 Subject: [PATCH 10/13] Update grafana-tikv-dashboard.md --- grafana-tikv-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index cc15ae462435b..4d801c3b589fe 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -11,7 +11,7 @@ If you use TiUP to deploy the TiDB cluster, the monitoring system (Prometheus/Gr The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, and so on. A lot of metrics are there to help you diagnose. -You can get an overview of the component TiKV status from the **TiKV-Details** dashboard, where the key metrics are displayed. According to the [Performance Map](https://asktug.com/_/tidb-performance-map/#/), you can check whether the status of the cluster is as expected. +You can get an overview of the component TiKV status from the **TiKV-Details** dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics on the **TiKV-Details** dashboard. From 0546baf3e0d33c5c1621b814be6574db4ab83749 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 14 Jul 2020 15:25:40 +0800 Subject: [PATCH 11/13] fix typos and refine format --- grafana-tikv-dashboard.md | 552 +++++++++++++++++++------------------- 1 file changed, 276 insertions(+), 276 deletions(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 4d801c3b589fe..5b75e3107b93f 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -18,43 +18,43 @@ This document provides a detailed description of these key metrics on the **TiKV ## Cluster - Store size: The storage size per TiKV instance -- Available size:The available capacity per TiKV instance -- Capacity size:The capacity size per TiKV instance -- CPU:The CPU usage per TiKV instance -- Memory:The memory usage per TiKV instance -- IO utilization:The I/O utilization per TiKV instance -- MBps:The total bytes of read and write in each TiKV instance -- QPS:The QPS per command in each TiKV instance -- Errps:The rate of gRPC message failures -- leader:The number of leaders per TiKV instance -- Region:The number of Regions per TiKV instance -- Uptime:The runtime of TiKV since last restart +- Available size: The available capacity per TiKV instance +- Capacity size: The capacity size per TiKV instance +- CPU: The CPU utilization per TiKV instance +- Memory: The memory usage per TiKV instance +- IO utilization: The I/O utilization per TiKV instance +- MBps: The total bytes of read and write in each TiKV instance +- QPS: The QPS per command in each TiKV instance +- Errps: The rate of gRPC message failures +- leader: The number of leaders per TiKV instance +- Region: The number of Regions per TiKV instance +- Uptime: The runtime of TiKV since last restart ![TiKV Dashboard - Cluster metrics](/media/tikv-dashboard-cluster.png) ## Errors -- Critical error:The number of critical errors -- Server is busy:Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, and so on. It should be `0` in normal case. -- Server report failures:The number of error messages reported by server. It should be `0` in normal case. -- Raftstore error:The number of Raftstore errors per type on each TiKV instance -- Scheduler error:The number of scheduler errors per type on each TiKV instance -- Coprocessor error:The number of coprocessor errors per type on each TiKV instance -- gRPC message error:The number of gRPC message errors per type on each TiKV instance -- Leader drop:The count of dropped leaders per TiKV instance -- Leader missing:The count of missing leaders per TiKV instance +- Critical error: The number of critical errors +- Server is busy: Indicates occurrences of events that make the TiKV instance unavailable temporarily, such as Write Stall, Channel Full, and so on. It should be `0` in normal case. +- Server report failures: The number of error messages reported by server. It should be `0` in normal case. +- Raftstore error: The number of Raftstore errors per type on each TiKV instance +- Scheduler error: The number of scheduler errors per type on each TiKV instance +- Coprocessor error: The number of coprocessor errors per type on each TiKV instance +- gRPC message error: The number of gRPC message errors per type on each TiKV instance +- Leader drop: The count of dropped leaders per TiKV instance +- Leader missing: The count of missing leaders per TiKV instance ![TiKV Dashboard - Errors metrics](/media/tikv-dashboard-errors.png) ## Server -- CF size:The size of each column family -- Store size:The storage size per TiKV instance -- Channel full:The number of Channel Full errors per TiKV instance. It should be `0` in normal case. -- Active written leaders:The number of leaders being written on each TiKV instance -- Approximate Region size:The approximate Region size -- Approximate Region size Histogram:The histogram of each approximate Region size -- Region average written keys:The average number of written keys to Regions per TiKV instance +- CF size: The size of each column family +- Store size: The storage size per TiKV instance +- Channel full: The number of Channel Full errors per TiKV instance. It should be `0` in normal case. +- Active written leaders: The number of leaders being written on each TiKV instance +- Approximate Region size: The approximate Region size +- Approximate Region size Histogram: The histogram of each approximate Region size +- Region average written keys: The average number of written keys to Regions per TiKV instance - Region average written bytes: The average written bytes to Regions per TiKV instance ![TiKV Dashboard - Server metrics](/media/tikv-dashboard-server.png) @@ -62,360 +62,360 @@ This document provides a detailed description of these key metrics on the **TiKV ## gRPC - gRPC message count: The rate of gRPC messages per type -- gRPC message failed:The rate of failed gRPC messages +- gRPC message failed: The rate of failed gRPC messages - 99% gRPC message duration: The gRPC message duration per message type (P99) - Average gRPC message duration: The average execution time of gRPC messages -- gRPC batch size:The batch size of gRPC messages between TiDB and TiKV -- Raft message batch size:The batch size of Raft messages between TiKV instances +- gRPC batch size: The batch size of gRPC messages between TiDB and TiKV +- Raft message batch size: The batch size of Raft messages between TiKV instances ## Thread CPU -- Raft store CPU:The CPU utilization of the `raftstore` thread. The CPU usage should be less than 80% * `raftstore.store-pool-size` in normal case. -- Async apply CPU:The CPU utilization of the `async apply` thread. The CPU usage should be less than 90% * `raftstore.apply-pool-size` in normal cases. -- Scheduler worker CPU:The CPU utilization of the `scheduler worker` thread. The CPU usage should be less than 90% * `storage.scheduler-worker-pool-size` in normal cases. -- gRPC poll CPU:The CPU utilization of the `gRPC` thread. The CPU usage should be less than 80% * `server.grpc-concurrency` in normal cases. -- Unified read pool CPU:The CPU utilization of `unified read pool` thread -- Storage ReadPool CPU:The CPU utilization of `storage read pool` thread -- Coprocessor CPU:The CPU utilization of `coprocessor` thread -- RocksDB CPU:The CPU utilization of RocksDB thread -- Split check CPU:The CPU utilization of `split check` thread -- GC worker CPU:The CPU utilization of `GC worker` thread -- Snapshot worker CPU:The CPU utilization of `snapshot worker` thread +- Raft store CPU: The CPU utilization of the `raftstore` thread. The CPU utilization should be less than 80% * `raftstore.store-pool-size` in normal case. +- Async apply CPU: The CPU utilization of the `async apply` thread. The CPU utilization should be less than 90% * `raftstore.apply-pool-size` in normal cases. +- Scheduler worker CPU: The CPU utilization of the `scheduler worker` thread. The CPU utilization should be less than 90% * `storage.scheduler-worker-pool-size` in normal cases. +- gRPC poll CPU: The CPU utilization of the `gRPC` thread. The CPU utilization should be less than 80% * `server.grpc-concurrency` in normal cases. +- Unified read pool CPU: The CPU utilization of the `unified read pool` thread +- Storage ReadPool CPU: The CPU utilization of the `storage read pool` thread +- Coprocessor CPU: The CPU utilization of the `coprocessor` thread +- RocksDB CPU: The CPU utilization of the RocksDB thread +- Split check CPU: The CPU utilization of the `split check` thread +- GC worker CPU: The CPU utilization of the `GC worker` thread +- Snapshot worker CPU: The CPU utilization of the `snapshot worker` thread ## PD -- PD requests:The rate of requests that TiKV sends to PD -- PD request duration (average):The average time consumed by requests that TiKV sends to PD -- PD heartbeats:The rate of heartbeat messages sended from TiKV to PD -- PD validate peers:The rate of messages that sended from TiKV to PD to validate peer +- PD requests: The rate at which TiKV sends to PD +- PD request duration (average): The average duration of processing requests that TiKV sends to PD +- PD heartbeats: The rate at which heartbeat messages are sent from TiKV to PD +- PD validate peers: The rate at which messages are sent from TiKV to PD to validate TiKV peers ## Raft IO -- Apply log duration:The time consumed for Raft to apply logs -- Apply log duration per server:The time consumed for Raft to apply logs per TiKV instance -- Append log duration:The time consumed for Raft to append logs -- Append log duration per server:The time consumed for Raft to append logs per TiKV instance -- Commit log duration:The time consumed for Raft to commit logs -- Commit log duration per server:The time consumed for Raft to commit logs per TiKV instance +- Apply log duration: The time consumed for Raft to apply logs +- Apply log duration per server: The time consumed for Raft to apply logs per TiKV instance +- Append log duration: The time consumed for Raft to append logs +- Append log duration per server: The time consumed for Raft to append logs per TiKV instance +- Commit log duration: The time consumed by Raft to commit logs +- Commit log duration per server: The time consumed by Raft to commit logs per TiKV instance ![TiKV Dashboard - Raft IO metrics](/media/tikv-dashboard-raftio.png) ## Raft process -- Ready handled:The count of handled ready operations per second -- 0.99 Duration of Raft store events:The time consumed by raftstore events (P99) -- Process ready duration:The time consumed for processes to be ready in Raft -- Process ready duration per server:The time consumed for peer processes to be ready in Raft per TiKV instance. It should be less than 2 seconds (P99.99). +- Ready handled: The count of handled ready operations per second +- 0.99 Duration of Raft store events: The time consumed by Raftstore events (P99) +- Process ready duration: The time consumed for processes to be ready in Raft +- Process ready duration per server: The time consumed for peer processes to be ready in Raft per TiKV instance. It should be less than 2 seconds (P99.99). ![TiKV Dashboard - Raft process metrics](/media/tikv-dashboard-raft-process.png) ## Raft message -- Sent messages per server:The number of Raft messages sent by each TiKV instance per second -- Flush messages per server:The number of Raft messages flushed by the Raft client in each TiKV instance per second -- Receive messages per server:The number of Raft messages received by each TiKV instance per second -- Messages:The number of Raft messages sent per type per second -- Vote:The number of Vote messages sent in Raft per second -- Raft dropped messages:The number of dropped Raft messages per type per second +- Sent messages per server: The number of Raft messages sent by each TiKV instance per second +- Flush messages per server: The number of Raft messages flushed by the Raft client in each TiKV instance per second +- Receive messages per server: The number of Raft messages received by each TiKV instance per second +- Messages: The number of Raft messages sent per type per second +- Vote: The number of Vote messages sent in Raft per second +- Raft dropped messages: The number of dropped Raft messages per type per second ![TiKV Dashboard - Raft message metrics](/media/tikv-dashboard-raft-message.png) ## Raft propose -- Raft apply proposals per ready:The histogram of the number of proposals that each ready operation containes in a batch while applying proposal. -- Raft read/write proposals:The number of proposals per type per second -- Raft read proposals per server:The number of read proposals made by each TiKV instance per second -- Raft write proposals per server:The number of write proposals made by each TiKV instance per second -- Propose wait duration:The histogram of wait time of each proposal -- Propose wait duration per server:The histogram of wait time of each proposal per TiKV instance -- Apply wait duration:The histogram of apply time of each proposal -- Apply wait duration per server:The histogram of apply time of each proposal per TiKV instance -- Raft log speed:The average rate at which peers propose logs +- Raft apply proposals per ready: The histogram of the number of proposals that each ready operation contains in a batch while applying proposal. +- Raft read/write proposals: The number of proposals per type per second +- Raft read proposals per server: The number of read proposals made by each TiKV instance per second +- Raft write proposals per server: The number of write proposals made by each TiKV instance per second +- Propose wait duration: The histogram of waiting time of each proposal +- Propose wait duration per server: The histogram of waiting time of each proposal per TiKV instance +- Apply wait duration: The histogram of apply time of each proposal +- Apply wait duration per server: The histogram of apply time of each proposal per TiKV instance +- Raft log speed: The average rate at which peers propose logs ![TiKV Dashboard - Raft propose metrics](/media/tikv-dashboard-raft-propose.png) ## Raft admin -- Admin proposals:The number of admin proposals per second -- Admin apply:The number of processed apply commands per second -- Check split:The number of raftstore split check commands per second -- 99.99% Check split duration:The time consumed when running split check commands (P99.99) +- Admin proposals: The number of admin proposals per second +- Admin apply: The number of processed apply commands per second +- Check split: The number of Raftstore split check commands per second +- 99.99% Check split duration: The time consumed when running split check commands (P99.99) ![TiKV Dashboard - Raft admin metrics](/media/tikv-dashboard-raft-admin.png) ## Local reader -- Local reader requests:The number of total requests and the number of rejections from the local read thread +- Local reader requests: The number of total requests and the number of rejections from the local read thread ![TiKV Dashboard - Local reader metrics](/media/tikv-dashboard-local-reader.png) ## Unified Read Pool -- Time used by level:The time consumed for each level in the unified read pool. Level 0 means small queries. -- Level 0 chance:The proportion of level 0 tasks in unified read pool -- Running tasks:The number of tasks running concurrently in the unified read pool +- Time used by level: The time consumed for each level in the unified read pool. Level 0 means small queries. +- Level 0 chance: The proportion of level 0 tasks in unified read pool +- Running tasks: The number of tasks running concurrently in the unified read pool ## Storage -- Storage command total:The number of received command by type per second -- Storage async request error:The number of engine asynchronous request errors per second -- Storage async snapshot duration:The time consumed by processing asynchronous snapshot requests. It should be less than `1s` in `.99`. -- Storage async write duration:The time consumed by processing asynchronous write requests. It should be less than `1s` in `.99`. +- Storage command total: The number of received command by type per second +- Storage async request error: The number of engine asynchronous request errors per second +- Storage async snapshot duration: The time consumed by processing asynchronous snapshot requests. It should be less than `1s` in `.99`. +- Storage async write duration: The time consumed by processing asynchronous write requests. It should be less than `1s` in `.99`. ![TiKV Dashboard - Storage metrics](/media/tikv-dashboard-storage.png) ## Scheduler -- Scheduler stage total:The number of commands at each stage per second. There should not be lots of errors in a short time. -- Scheduler writing bytes:The total written bytes of commands processed by each TiKV instance -- Scheduler priority commands:The count of different priority commands per second -- Scheduler pending commands:The count of pending commands per TiKV instance per second +- Scheduler stage total: The number of commands at each stage per second. There should not be a lot of errors in a short time. +- Scheduler writing bytes: The total written bytes by commands processed on each TiKV instance +- Scheduler priority commands: The count of different priority commands per second +- Scheduler pending commands: The count of pending commands per TiKV instance per second ![TiKV Dashboard - Scheduler metrics](/media/tikv-dashboard-scheduler.png) ## Scheduler - commit -- Scheduler stage total:The number of commands at each stage per second when executing the commit command. There should not be lots of errors in a short time. -- Scheduler command duration:The time consumed when executing the commit command. It should be less than `1s`. -- Scheduler latch wait duration:The wait time caused by latch when executing the commit command. It should be less than `1s`. -- Scheduler keys read:The count of keys read by a commit command -- Scheduler keys written:The count of keys written by a commit command -- Scheduler scan details:The keys scan details of each CF when executing the commit command. -- Scheduler scan details [lock]:The keys scan details of lock CF when executing the commit command -- Scheduler scan details [write]:The keys scan details of write CF when executing the commit command -- Scheduler scan details [default]:The keys scan details of default CF when executing the commit command +- Scheduler stage total: The number of commands at each stage per second when executing the commit command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the commit command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the commit command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a commit command +- Scheduler keys written: The count of keys written by a commit command +- Scheduler scan details: The keys scan details of each CF when executing the commit command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the commit command +- Scheduler scan details [write]: The keys scan details of write CF when executing the commit command +- Scheduler scan details [default]: The keys scan details of default CF when executing the commit command ![TiKV Dashboard - Scheduler commit metrics](/media/tikv-dashboard-scheduler-commit.png) ## Scheduler - pessimistic_rollback -- Scheduler stage total:The number of commands at each stage per second when executing the pessimistic_rollback command. There should not be lots of errors in a short time. -- Scheduler command duration:The time consumed when executing the pessimistic_rollback command. It should be less than `1s`. -- Scheduler latch wait duration:The wait time caused by latch when executing the pessimistic_rollback command. It should be less than `1s`. -- Scheduler keys read:The count of keys read by a pessimistic_rollback command -- Scheduler keys written:The count of keys written by a pessimistic_rollback command -- Scheduler scan details:The keys scan details of each CF when executing the pessimistic_rollback command. -- Scheduler scan details [lock]:The keys scan details of lock CF when executing the pessimistic_rollback command -- Scheduler scan details [write]:The keys scan details of write CF when executing the pessimistic_rollback command -- Scheduler scan details [default]:The keys scan details of default CF when executing the pessimistic_rollback command +- Scheduler stage total: The number of commands at each stage per second when executing the `pessimistic_rollback` command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the `pessimistic_rollback` command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the `pessimistic_rollback` command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a `pessimistic_rollback` command +- Scheduler keys written: The count of keys written by a `pessimistic_rollback` command +- Scheduler scan details: The keys scan details of each CF when executing the `pessimistic_rollback` command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the `pessimistic_rollback` command +- Scheduler scan details [write]: The keys scan details of write CF when executing the `pessimistic_rollback` command +- Scheduler scan details [default]: The keys scan details of default CF when executing the `pessimistic_rollback` command ## Scheduler - prewrite -- Scheduler stage total:The number of commands at each stage per second when executing the prewrite command. There should not be lots of errors in a short time. -- Scheduler command duration:The time consumed when executing the prewrite command. It should be less than `1s`. -- Scheduler latch wait duration:The wait time caused by latch when executing the prewrite command. It should be less than `1s`. -- Scheduler keys read:The count of keys read by a prewrite command -- Scheduler keys written:The count of keys written by a prewrite command -- Scheduler scan details:The keys scan details of each CF when executing the prewrite command. -- Scheduler scan details [lock]:The keys scan details of lock CF when executing the prewrite command -- Scheduler scan details [write]:The keys scan details of write CF when executing the prewrite command -- Scheduler scan details [default]:The keys scan details of default CF when executing the prewrite command +- Scheduler stage total: The number of commands at each stage per second when executing the prewrite command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the prewrite command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the prewrite command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a prewrite command +- Scheduler keys written: The count of keys written by a prewrite command +- Scheduler scan details: The keys scan details of each CF when executing the prewrite command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the prewrite command +- Scheduler scan details [write]: The keys scan details of write CF when executing the prewrite command +- Scheduler scan details [default]: The keys scan details of default CF when executing the prewrite command ## Scheduler - rollback -- Scheduler stage total:The number of commands at each stage per second when executing the rollback command. There should not be lots of errors in a short time. -- Scheduler command duration:The time consumed when executing the rollback command. It should be less than `1s`. -- Scheduler latch wait duration:The wait time caused by latch when executing the rollback command. It should be less than `1s`. -- Scheduler keys read:The count of keys read by a rollback command -- Scheduler keys written:The count of keys written by a rollback command -- Scheduler scan details:The keys scan details of each CF when executing the rollback command. -- Scheduler scan details [lock]:The keys scan details of lock CF when executing the rollback command -- Scheduler scan details [write]:The keys scan details of write CF when executing the rollback command -- Scheduler scan details [default]:The keys scan details of default CF when executing the rollback command +- Scheduler stage total: The number of commands at each stage per second when executing the rollback command. There should not be a lot of errors in a short time. +- Scheduler command duration: The time consumed when executing the rollback command. It should be less than `1s`. +- Scheduler latch wait duration: The waiting time caused by latch when executing the rollback command. It should be less than `1s`. +- Scheduler keys read: The count of keys read by a rollback command +- Scheduler keys written: The count of keys written by a rollback command +- Scheduler scan details: The keys scan details of each CF when executing the rollback command. +- Scheduler scan details [lock]: The keys scan details of lock CF when executing the rollback command +- Scheduler scan details [write]: The keys scan details of write CF when executing the rollback command +- Scheduler scan details [default]: The keys scan details of default CF when executing the rollback command ## GC -- MVCC versions:The number of versions for each key -- MVCC delete versions:The number of versions deleted by GC for each key -- GC tasks:The count of GC tasks processed by gc_worker -- GC tasks Duration:The time consumed when executing GC tasks -- GC keys (write CF):The count of keys in write CF affected during GC -- TiDB GC worker actions:The count of TiDB GC worker actions -- TiDB GC seconds:The GC duration -- GC speed:The number of keys deleted by GC per second -- TiKV AutoGC Working:The status of Auto GC -- ResolveLocks Progress:The progress of the first phase of GC(Resolve Locks) -- TiKV Auto GC Progress:The progress of the second phase of GC -- TiKV Auto GC SafePoint:The value of TiKV GC safe point. The safe point is the current GC timestamp -- GC lifetime:The lifetime of TiDB GC -- GC interval:The interval of TiDB GC +- MVCC versions: The number of versions for each key +- MVCC delete versions: The number of versions deleted by GC for each key +- GC tasks: The count of GC tasks processed by gc_worker +- GC tasks Duration: The time consumed when executing GC tasks +- GC keys (write CF): The count of keys in write CF affected during GC +- TiDB GC worker actions: The count of TiDB GC worker actions +- TiDB GC seconds: The GC duration +- GC speed: The number of keys deleted by GC per second +- TiKV AutoGC Working: The status of Auto GC +- ResolveLocks Progress: The progress of the first phase of GC (Resolve Locks) +- TiKV Auto GC Progress: The progress of the second phase of GC +- TiKV Auto GC SafePoint: The value of TiKV GC safe point. The safe point is the current GC timestamp +- GC lifetime: The lifetime of TiDB GC +- GC interval: The interval of TiDB GC ## Snapshot -- Rate snapshot message:The rate at which Raft snapshot messages are sent -- 99% Handle snapshot duration:The time consumed to handle snapshots (P99) -- Snapshot state count:The number of snapshots per state -- 99.99% Snapshot size:The snapshot size (P99.99) -- 99.99% Snapshot KV count:The number of KV within a snapshot (P99.99) +- Rate snapshot message: The rate at which Raft snapshot messages are sent +- 99% Handle snapshot duration: The time consumed to handle snapshots (P99) +- Snapshot state count: The number of snapshots per state +- 99.99% Snapshot size: The snapshot size (P99.99) +- 99.99% Snapshot KV count: The number of KV within a snapshot (P99.99) ## Task -- Worker handled tasks:The number of tasks handled by worker per second -- Worker pending tasks:Current number of pending and running tasks of worker per second. It should be less than `1000` in normal case. -- FuturePool handled tasks:The number of tasks handled by future pool per second -- FuturePool pending tasks:Current number of pending and running tasks of future pool per second +- Worker handled tasks: The number of tasks handled by worker per second +- Worker pending tasks: Current number of pending and running tasks of worker per second. It should be less than `1000` in normal case. +- FuturePool handled tasks: The number of tasks handled by future pool per second +- FuturePool pending tasks: Current number of pending and running tasks of future pool per second ## Coprocessor Overview -- Request duration:The total time spent from receiving the coprocessor request to the end of request processing -- Total Requests:The number of requests by type per second -- Handle duration:The histogram of time spent actually processing coprocessor requests per minute -- Total Request Errors:The number of request errors of Coprocessor. There should not be lots of errors in a short time. -- Total KV Cursor Operations:The total number of the KV cursor operations by type per second, such as select, index, analyze_table, analyze_index, checksum_table, checksum_index, and so on. -- KV Cursor Operations:The histogram of KV cursor operations by type per second -- Total RocksDB Perf Statistics:The performance statistics of RocksDB -- Total Response Size:The total size of coprocessor response +- Request duration: The total duration from the time of receiving the coprocessor request to the time of finishing processing the request +- Total Requests: The number of requests by type per second +- Handle duration: The histogram of time spent actually processing coprocessor requests per minute +- Total Request Errors: The number of request errors of Coprocessor per second. There should not be a lot of errors in a short time. +- Total KV Cursor Operations: The total number of the KV cursor operations by type per second, such as `select`, `index`, `analyze_table`, `analyze_index`, `checksum_table`, `checksum_index`, and so on. +- KV Cursor Operations: The histogram of KV cursor operations by type per second +- Total RocksDB Perf Statistics: The statistics of RocksDB performance +- Total Response Size: The total size of coprocessor response ## Coprocessor Detail -- Handle duration:The histogram of time spent actually processing coprocessor requests per minute -- 95% Handle duration by store:The time consumed to handle coprocessor requests per TiKV instance per second (P95) -- Wait duration:The time consumed when coprocessor requests are waiting to be handled. It should be less than `10s` (P99.99). -- 95% Wait duration by store:The time consumed when coprocessor requests are waiting to be handled per TiKV instance per second (P95) -- Total DAG Requests:The total number of DAG requests per second -- Total DAG Executors:The total number of DAG executors per second -- Total Ops Details (Table Scan):The number of RocksDB internal operations per second when executing select scan in coprocessor -- Total Ops Details (Index Scan):The number of RocksDB internal operations per second when executing index scan in coprocessor -- Total Ops Details by CF (Table Scan):The number of RocksDB internal operations for each CF per second when executing select scan in coprocessor -- Total Ops Details by CF (Index Scan):The number of RocksDB internal operations for each CF per second when executing index scan in coprocessor +- Handle duration: The histogram of time spent actually processing coprocessor requests per minute +- 95% Handle duration by store: The time consumed to handle coprocessor requests per TiKV instance per second (P95) +- Wait duration: The time consumed when coprocessor requests are waiting to be handled. It should be less than `10s` (P99.99). +- 95% Wait duration by store: The time consumed when coprocessor requests are waiting to be handled per TiKV instance per second (P95) +- Total DAG Requests: The total number of DAG requests per second +- Total DAG Executors: The total number of DAG executors per second +- Total Ops Details (Table Scan): The number of RocksDB internal operations per second when executing select scan in coprocessor +- Total Ops Details (Index Scan): The number of RocksDB internal operations per second when executing index scan in coprocessor +- Total Ops Details by CF (Table Scan): The number of RocksDB internal operations for each CF per second when executing select scan in coprocessor +- Total Ops Details by CF (Index Scan): The number of RocksDB internal operations for each CF per second when executing index scan in coprocessor ## Threads -- Threads state:The state of TiKV threads -- Threads IO:The I/O traffic of each TiKV thread -- Thread Voluntary Context Switches:The number of TiKV threads voluntary context switches -- Thread Nonvoluntary Context Switches:The number of TiKV threads nonvoluntary context switches +- Threads state: The state of TiKV threads +- Threads IO: The I/O traffic of each TiKV thread +- Thread Voluntary Context Switches: The number of TiKV threads voluntary context switches +- Thread Nonvoluntary Context Switches: The number of TiKV threads nonvoluntary context switches ## RocksDB - kv/raft -- Get operations:The count of get operations per second -- Get duration:The time consumed when executing get operations -- Seek operations:The count of seek operations per second -- Seek duration:The time consumed when executing seek operations -- Write operations:The count of write operations per second -- Write duration:The time consumed when executing write operations -- WAL sync operations:The count of WAL sync operations per second -- Write WAL duration:The time consumed for writing WAL -- WAL sync duration:The time consumed when executing WAL sync operations +- Get operations: The count of get operations per second +- Get duration: The time consumed when executing get operations +- Seek operations: The count of seek operations per second +- Seek duration: The time consumed when executing seek operations +- Write operations: The count of write operations per second +- Write duration: The time consumed when executing write operations +- WAL sync operations: The count of WAL sync operations per second +- Write WAL duration: The time consumed for writing WAL +- WAL sync duration: The time consumed when executing WAL sync operations - Compaction operations: The count of compaction and flush operations per second -- Compaction duration:The time consumed when executing the compaction and flush operations -- SST read duration:The time consumed when reading SST files -- Write stall duration:Write stall duration. It should be `0` in normal case. -- Memtable size:The memtable size of each column family -- Memtable hit:The hit rate of memtable -- Block cache size:The block cache size. Broken down by column family if shared block cache is disabled. -- Block cache hit:The hit rate of block cache -- Block cache flow:The flow rate of block cache operations per type +- Compaction duration: The time consumed when executing the compaction and flush operations +- SST read duration: The time consumed when reading SST files +- Write stall duration: Write stall duration. It should be `0` in normal case. +- Memtable size: The memtable size of each column family +- Memtable hit: The hit rate of memtable +- Block cache size: The block cache size. Broken down by column family if shared block cache is disabled. +- Block cache hit: The hit rate of block cache +- Block cache flow: The flow rate of block cache operations per type - Block cache operations: The count of block cache operations per type -- Keys flow:The flow rate of operations on keys per type -- Total keys:The count of keys in each column family -- Read flow:The flow rate of read operations per type -- Bytes / Read:The bytes per read operation -- Write flow:The flow rate of write operations per type -- Bytes / Write:The bytes per write operation -- Compaction flow:The flow rate of compaction operations per type -- Compaction pending bytes:The pending bytes to be compacted -- Read amplification:The read amplification per TiKV instance -- Compression ratio:The compression ratio of each level -- Number of snapshots:The number of snapshots per TiKV instance -- Oldest snapshots duration:The time that the oldest unreleased snapshot survivals -- Number files at each level:The number of SST files for different column families in each level -- Ingest SST duration seconds:The time consumed to ingest SST files -- Stall conditions changed of each CF:Stall conditions changed of each column family +- Keys flow: The flow rate of operations on keys per type +- Total keys: The count of keys in each column family +- Read flow: The flow rate of read operations per type +- Bytes / Read: The bytes per read operation +- Write flow: The flow rate of write operations per type +- Bytes / Write: The bytes per write operation +- Compaction flow: The flow rate of compaction operations per type +- Compaction pending bytes: The pending bytes to be compacted +- Read amplification: The read amplification per TiKV instance +- Compression ratio: The compression ratio of each level +- Number of snapshots: The number of snapshots per TiKV instance +- Oldest snapshots duration: The time that the oldest unreleased snapshot survivals +- Number files at each level: The number of SST files for different column families in each level +- Ingest SST duration seconds: The time consumed to ingest SST files +- Stall conditions changed of each CF: Stall conditions changed of each column family ## Titan - All -- Blob file count:The number of Titan blob files -- Blob file size:The total size of Titan blob file -- Live blob size:The total size of valid blob record -- Blob cache hit:The hit rate of Titan block cache -- Iter touched blob file count:The number of blob file involved in a single iterator -- Blob file discardable ratio distribution:The ratio distribution of blob record failure of blob files -- Blob key size:The size of Titan blob keys -- Blob value size:The size of Titan blob values -- Blob get operations:The count of get operations in Titan blob -- Blob get duration:The time consumed when executing get operations in Titan blob -- Blob iter operations:The time consumed when executing iter operations in Titan blob -- Blob seek duration:The time consumed when executing seek operations in Titan blob -- Blob next duration:The time consumed when executing next operations in Titan blob -- Blob prev duration:The time consumed when executing prev operations in Titan blob -- Blob keys flow:The flow rate of operations on Titan blob keys -- Blob bytes flow:The flow rate of bytes on Titan blob keys -- Blob file read duration:The time consumed when reading Titan blob file -- Blob file write duration:The time consumed when writing Titan blob file -- Blob file sync operations:The count of blob file sync operations -- Blob file sync duration:The time consumed when synchronizing blob file -- Blob GC action:The count of Titan GC actions -- Blob GC duration:The Titan GC duration -- Blob GC keys flow:The flow rate of keys read and written by Titan GC -- Blob GC bytes flow:The flow rate of bytes read and written by Titan GC -- Blob GC input file size:The size of Titan GC input file -- Blob GC output file size:The size of Titan GC output file -- Blob GC file count:The count of blob files involved in Titan GC +- Blob file count: The number of Titan blob files +- Blob file size: The total size of Titan blob file +- Live blob size: The total size of valid blob record +- Blob cache hit: The hit rate of Titan block cache +- Iter touched blob file count: The number of blob file involved in a single iterator +- Blob file discardable ratio distribution: The ratio distribution of blob record failure of blob files +- Blob key size: The size of Titan blob keys +- Blob value size: The size of Titan blob values +- Blob get operations: The count of get operations in Titan blob +- Blob get duration: The time consumed when executing get operations in Titan blob +- Blob iter operations: The time consumed when executing iter operations in Titan blob +- Blob seek duration: The time consumed when executing seek operations in Titan blob +- Blob next duration: The time consumed when executing next operations in Titan blob +- Blob prev duration: The time consumed when executing prev operations in Titan blob +- Blob keys flow: The flow rate of operations on Titan blob keys +- Blob bytes flow: The flow rate of bytes on Titan blob keys +- Blob file read duration: The time consumed when reading Titan blob file +- Blob file write duration: The time consumed when writing Titan blob file +- Blob file sync operations: The count of blob file sync operations +- Blob file sync duration: The time consumed when synchronizing blob file +- Blob GC action: The count of Titan GC actions +- Blob GC duration: The Titan GC duration +- Blob GC keys flow: The flow rate of keys read and written by Titan GC +- Blob GC bytes flow: The flow rate of bytes read and written by Titan GC +- Blob GC input file size: The size of Titan GC input file +- Blob GC output file size: The size of Titan GC output file +- Blob GC file count: The count of blob files involved in Titan GC ## Lock manager -- Thread CPU:The CPU utilization of the lock manager thread -- Handled tasks:The number of taks handled by lock manager -- Waiter lifetime duration:The waiting time of the transaction for the lock to be released -- Wait table:The status information of wait table, including the number of locks and the number of transactions waiting for the lock -- Deadlock detect duration:The time consumed for detecting deadlock -- Detect error:The number of errors encountered when detecting deadlock, including the number of deadlocks -- Deadlock detector leader:The information of the node where the deadlock detector leader is located +- Thread CPU: The CPU utilization of the lock manager thread +- Handled tasks: The number of tasks handled by lock manager +- Waiter lifetime duration: The waiting time of the transaction for the lock to be released +- Wait table: The status information of wait table, including the number of locks and the number of transactions waiting for the lock +- Deadlock detect duration: The time consumed for detecting deadlock +- Detect error: The number of errors encountered when detecting deadlock, including the number of deadlocks +- Deadlock detector leader: The information of the node where the deadlock detector leader is located ## Memory -- Allocator Stats:The statistics of the memory allocator +- Allocator Stats: The statistics of the memory allocator ## Backup -- Backup CPU:The CPU utilization of the backup thread -- Range Size:The histogram of backup range size -- Backup Duration:The time consumed for backup -- Backup Flow:The total bytes of backup -- Disk Throughput:The disk throughput per instance -- Backup Range Duration:The time consumed for backing up a range -- Backup Errors:The number of errors encountered during a backup +- Backup CPU: The CPU utilization of the backup thread +- Range Size: The histogram of backup range size +- Backup Duration: The time consumed for backup +- Backup Flow: The total bytes of backup +- Disk Throughput: The disk throughput per instance +- Backup Range Duration: The time consumed for backing up a range +- Backup Errors: The number of errors encountered during a backup ## Encryption -- Encryption data keys:The total number of encrypted data keys -- Encrypted files:The number of encrypted files -- Encryption initialized:Shows whether encryption is enabled, `1` means enabled. -- Encryption meta files size:The size of the encryption meta file -- Encrypt/decrypt data nanos:The histogram of duration on encrypting/decrypting data each time -- Read/write encryption meta duration:The time consumed for reading/writing encryption meta file +- Encryption data keys: The total number of encrypted data keys +- Encrypted files: The number of encrypted files +- Encryption initialized: Shows whether encryption is enabled. `1` means enabled. +- Encryption meta files size: The size of the encryption meta file +- Encrypt/decrypt data nanos: The histogram of duration on encrypting/decrypting data each time +- Read/write encryption meta duration: The time consumed for reading/writing encryption meta files ## Explanation of Common Parameters ### gRPC Message Type -1. Transactional API: - - - kv_get:The command of getting the latest version of data specified by ts - - kv_scan:The command of scanning a range of data - - kv_prewrite:The command of prewriting the data to be committed at first phase of 2PC - - kv_pessimistic_lock:The command of adding a pessimistic lock to the key to prevent other transaction from modifying this key - - kv_pessimistic_rollback:The command of deleting the pessimistic lock on the key - - kv_txn_heart_beat:The command of updating `lock_ttl` for pessimistic transactions or large transactions to prevent them from rolling back - - kv_check_txn_status:The command of checking the status of the transaction - - kv_commit:The command of committing the data written by prewrite command - - kv_cleanup:The command of rolling back a transaction, which is deprecated in v4.0 - - kv_batch_get:The command of getting the value of batch key at once, similar to `kv_get`. - - kv_batch_rollback:The command of batch rollback of multiple prewrite transaction - - kv_scan_lock:The command of scanning all locks with a version number before `max_version` to clean up expired transactions - - kv_resolve_lock:The command of committing or rollback the transaction lock, according to the transaction status. - - kv_gc:The command of GC - - kv_delete_range:The command of deleting a range of data from TiKV - -2. Raw API: - - - raw_get:The command of getting the value of key - - raw_batch_get:The command of getting the value of batch keys - - raw_scan:The command of scanning a range of data - - raw_batch_scan:The command of scanning multiple consecutive data range - - raw_put:The command of writing a key/value pair - - raw_batch_put:The command of writing a batch of key/value pairs - - raw_delete:The command of deleting a key/value pair - - raw_batch_delete:The command of a batch of key/value pairs - - raw_delete_range:The command of deleting a range of data +1. Transactional API: + + - kv_get: The command of getting the latest version of data specified by `ts` + - kv_scan: The command of scanning a range of data + - kv_prewrite: The command of prewriting the data to be committed at first phase of 2PC + - kv_pessimistic_lock: The command of adding a pessimistic lock to the key to prevent other transaction from modifying this key + - kv_pessimistic_rollback: The command of deleting the pessimistic lock on the key + - kv_txn_heart_beat: The command of updating `lock_ttl` for pessimistic transactions or large transactions to prevent them from rolling back + - kv_check_txn_status: The command of checking the status of the transaction + - kv_commit: The command of committing the data written by the prewrite command + - kv_cleanup: The command of rolling back a transaction, which is deprecated in v4.0 + - kv_batch_get: The command of getting the value of batch key at once, similar to `kv_get` + - kv_batch_rollback: The command of batch rollback of multiple prewrite transactions + - kv_scan_lock: The command of scanning all locks with a version number before `max_version` to clean up expired transactions + - kv_resolve_lock: The command of committing or rollback the transaction lock, according to the transaction status. + - kv_gc: The command of GC + - kv_delete_range: The command of deleting a range of data from TiKV + +2. Raw API: + + - raw_get: The command of getting the value of key + - raw_batch_get: The command of getting the value of batch keys + - raw_scan: The command of scanning a range of data + - raw_batch_scan: The command of scanning multiple consecutive data range + - raw_put: The command of writing a key/value pair + - raw_batch_put: The command of writing a batch of key/value pairs + - raw_delete: The command of deleting a key/value pair + - raw_batch_delete: The command of a batch of key/value pairs + - raw_delete_range: The command of deleting a range of data From 3c91884adec1364f0d19992ba929c4221b221dc2 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 14 Jul 2020 15:48:06 +0800 Subject: [PATCH 12/13] Update grafana-tikv-dashboard.md --- grafana-tikv-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index 5b75e3107b93f..c31e498d86ee7 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -5,7 +5,7 @@ category: reference aliases: ['/docs/dev/grafana-tikv-dashboard/','/docs/dev/reference/key-monitoring-metrics/tikv-dashboard/'] --- -# Description of TiKV Monitoring Metrics +# Key Monitoring Metrics of TiKV If you use TiUP to deploy the TiDB cluster, the monitoring system (Prometheus/Grafana) is deployed at the same time. For more information, see [Overview of the Monitoring Framework](/tidb-monitoring-framework.md). From 9b8d7f23eb44b93652d410f2652a4ec460ccd99e Mon Sep 17 00:00:00 2001 From: Win-Man <825895587@qq.com> Date: Sat, 18 Jul 2020 21:57:17 +0800 Subject: [PATCH 13/13] add content --- grafana-tikv-dashboard.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md index c31e498d86ee7..9fb469c24a191 100644 --- a/grafana-tikv-dashboard.md +++ b/grafana-tikv-dashboard.md @@ -11,7 +11,7 @@ If you use TiUP to deploy the TiDB cluster, the monitoring system (Prometheus/Gr The Grafana dashboard is divided into a series of sub dashboards which include Overview, PD, TiDB, TiKV, Node_exporter, and so on. A lot of metrics are there to help you diagnose. -You can get an overview of the component TiKV status from the **TiKV-Details** dashboard, where the key metrics are displayed. +You can get an overview of the component TiKV status from the **TiKV-Details** dashboard, where the key metrics are displayed. According to the [Performance Map](https://asktug.com/_/tidb-performance-map/#/), you can check whether the status of the cluster is as expected. This document provides a detailed description of these key metrics on the **TiKV-Details** dashboard.