From 3331c3668ce2dd569b23dfdcfda0d595f5b6c4b7 Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Fri, 15 Nov 2019 14:15:36 +0800 Subject: [PATCH 1/8] reference: add PD scheduling best practice and glossary --- dev/glossary.md | 67 +++++ dev/reference/best-practices/pd-scheduling.md | 250 ++++++++++++++++++ 2 files changed, 317 insertions(+) create mode 100644 dev/glossary.md create mode 100644 dev/reference/best-practices/pd-scheduling.md diff --git a/dev/glossary.md b/dev/glossary.md new file mode 100644 index 0000000000000..d9c170dd54c60 --- /dev/null +++ b/dev/glossary.md @@ -0,0 +1,67 @@ +--- +title: Glossary +summary: Glossaries about TiDB. +category: glossary +--- + +# Glossary + +## L + +### Leader/Follower/Learner + +Leader/Follower/Learner each corresponds to three roles in a group of [Peers](#regionpeerraft-group). The Leader services all client requests and replicated data to the Followers. If the group Leader fails, one of the Followers will be elected as the new Leader. Learners are non-voting Followers that only synchronizes raft logs, and currently exists briefly in the process of replica addition. + +## O + +### Operator + +An Operator is a collection of actions that applies to a Region and serves a scheduling purpose. For example, "migrate Region 2 Leader to Store 5", "migrate a replica of Region 2 to Store 1, 4, 5". + +An Operator can be computed and generated by a Scheduler, or created by an external API. + +### Operator Step + +An Operator Step is a step in the execution of an Operator. An Operator normally contains multiple Operator steps. + +Currently, available Steps generated by PD include: + +- `TransferLeader`: migrate a Region Leader to a specified Peer +- `AddPeer`: add Followers to a specified Store +- `RemovePeer`: delete a Region Peer +- `AddLearner`: add a Region Learner to a specified Store +- `PromoteLearner`: promote a specified Learner to a voting member +- `SplitRegion`: split a Region in two + +## P + +### `Pending`/`Down` + +`Pending` and `Down` are two special states of Peer. `Pending` indicates that the raft log of Followers or Learners is vastly different from that of Leader, and Followers in `Pending` cannot be elected as Leader. `Down` refers to a state that a Peer ceases to respond to the corresponding Leader for a long time, which usually means that the corresponding node is down or isolated from the network. + +## R + +### Region/Peer/Raft Group + +Each Region maintains a continuous piece of data for the cluster (an average of about 96 MiB by default), each of which stores multiple replicas in different Stores (3 replicas by default) and each replica is referred as a Peer. Multiple Peers of the same Region synchronize data via raft protocol, so Peers also refer to members of a raft instance. TiKV uses the multi-raft pattern to manage data, that is, each Region has a corresponding, standalone raft instance (also known as a Raft Group). + +### Region Split + +Regions in the TiKV cluster are gradually split and generated as the written data accrues. The process of splitting is called Region Split. + +The mechanism of Region Split is to build an initial Region to cover the entire key space in cluster initialization, and then generate a new Region through Split every time the Region data reaches a certain amount. + +## S + +### Scheduler + +Scheduler is a component in PD that generates scheduling tasks. Each scheduler in PD runs independently and serves different purposes. Common schedulers and their purposes are: + +- `balance-leader-scheduler`: maintain Leader balance of different nodes +- `balance-region-scheduler`: maintain Peer balance of different nodes +- `hot-region-scheduler`: maintain hot Region balance of different nodes +- `evict-leader-{store-id}`: remove all Leaders of a node (often used for rolling upgrades) + +### Store + +A Store in PD refers to the storage node in the cluster, also a instance of tikv-server. Each Store has a corresponding TiKV instance, which means if multiple TiKV instances are deployed on the same host or even on the same disk, these instances still correspond to different Stores. diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md new file mode 100644 index 0000000000000..5b3449e5b96ba --- /dev/null +++ b/dev/reference/best-practices/pd-scheduling.md @@ -0,0 +1,250 @@ +--- +title: PD Scheduling +summary: Learn best practice and strategy for PD scheduling. +category: reference +--- + +# PD Scheduling + +This document details the principles and strategies of PD scheduling through common scenarios to facilitate user application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: + +- [Leader/Follower/Learner](/dev/glossary.md#leaderfollowerlearner) +- [Operator](/dev/glossary.md#operator) +- [Operator Step](/dev/glossary.md#operator-step) +- [Pending/Down](/dev/glossary.md#pendingdown) +- [Region/Peer/Raft Group](/dev/glossary.md#regionpeerraft-group) +- [Region Split](/dev/glossary.md#region-split) +- [Scheduler](/dev/glossary.md#scheduler) +- [Store](/dev/glossary.md#store) + +> **Note:** +> +> This document initially targets TiDB 3.0. Though some functions are not supported in earlier versions (2.x), but this document still can be used as a reference for similar principles. + +## PD scheduling policies + +This section introduces the principle and process involved in the scheduling system. + +### Scheduling process + +Scheduling process generally has three steps: + +1. Collect information + + Each TiKV node periodically reports two heartbeats to PD: `StoreHeartbeat` and `RegionHeartbeat`. `StoreHeartbeat` contains the overall information of Stores including disk capacity, remaining storage, reads and writes traffic. `RegionHeartbeat` contains the overall information of Regions including the range of each Region, peer distribution, peer status, data volume, and reads and writes traffic. PD collects and restores these information for scheduling decisions. + +2. Generate Operators + + Different schedulers generate Operators to be executed based on their own logic and requirements, with the consideration of constraints and limitations include: + + - Do not add Peers to a Store with abnormal states (disconnect, down, busy, out of space, with extensive sent/received Snapshots) + - Do not balance Regions with abnormal states + - Do not attempt to transfer a Leader to a Pending Peer + - Do not attempt to remove a Leader directly + - Do not break the physical isolation of various Region Peers + - Do not break constraints such as Label property + +3. Execute scheduling tasks + + The generated Operator first joins a queue managed by `OperatorController` rather than be executed immediately. The `OperatorController` takes the Operator out of the queue and executes it with a certain amount of concurrency based on the configuration. The procedure is to distribute each Operator Step to the corresponding Region Leader. In the end, the Operator is marked as finish or timeout and removed from the execution list. + +### Load balancing + +Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing Regions evenly across all Stores in the cluster but with separate focuses: `balance-leader` attends Region Leader to balance incoming client requests. `balance-region` centers each Region Peer to redistribute the pressure of storage while avoid exceptions like storage failure. + +`balance-leader` and `balance-region` share a similar scheduling process. First, they grade different Stores according to their amount of resources. Then, `balance-leader` and `balance-region` constantly select Leaders or Peers from Stores with high scores and transfer them to Stores with low scores. Their grading methods vary though: `balance-leader` uses the sum of all Region Sizes corresponding to Leaders in a Store whereas the way of `balance-region` is relatively complicated. Since the storage capacity of different nodes might be inconsistent, the grading of `balance-region` is: + +- based on the amount of data when there is abundant storage (to make the amount of data of different nodes balanced) +- based on the remaining storage when there is insufficient storage (to make the remaining storage of different nodes balanced) +- based on the weighted sum of the two factors above when storage is neither abundant nor insufficient. + +Since different nodes might differ in performance, you can also set the weight of load balance for different Stores. `leader-weight` and `region-weight` are used to control the Leader weight and Region weight respectively (1 by default). For example, when the `leader-weight` of a Store is set to 2, the number of Leaders of the node is about twice as large as that of other nodes after the scheduling is stable. Similarly, when the `leader-weight` of a Store is set to 0.5, the number of Leaders of the node is about half as large as that of other nodes. + +### Hot Regions scheduling + +Hot Regions scheduling uses `hot-region-scheduler`. Currently in v3.0, the only way to count hot Regions is to determine whether the read/write traffic exceeds a certain threshold for a certain period based on the information reported by Stores. Then, redistribute these Regions in a similar way to load balancing. + +For hot write Region, `hot-region-scheduler` attempts to redistribute both Region Peers and Leaders; for hot read Region, `hot-region-scheduler` only redistributes Region Leaders. + +### Cluster topology awareness + +Cluster topology awareness (zone/rack/host awareness) is having the knowledge of cluster topology or more specifically how the different data nodes are distributed. This is to make the different Regions Peers as distributed as possible through scheduling, hence to achieve high availability and disaster recovery. PD continuously scans all Regions in the background. When PD finds that the distribution of Regions is not optimal, it generates a Operator to replace Peers and redistribute Regions. + +The component to check the distribution of Regions is `replicaChecker`, which is similar to Scheduler except that it cannot be disabled. `replicaChecker` checks and schedules based on the information provided by the configuration item `location-labels`. For example, `[zone, rack, host]` defines a three-tier topology: the cluster is configured with multiple available zones, with multiple racks under each zone and multiple hosts under each rack. PD attempts to schedule Region Peers to different zones first, or to different racks when zones are limited, or to different hosts when racks are limited. + +### Scale-down and failure recovery + +Scale-down refers to taking a Store offline. You can use commands to mark the Store as `Offline` while PD replicates the data the offline node held onto other nodes by scheduling. Failure recovery applies when Stores failed and cannot be recovered. In such case, a Region with Peers distributed on the corresponding Store might lose replicas, which requires PD to replenish replicas for these Regions on other nodes. + +The processes of Scale-down and failure recovery are basically the same. `replicaChecker` finds a Region Peer with abnormal states, and then generates a scheduling task to create a new replica on a healthy Store to replace the abnormal one. + +### Region merge + +Region merge refers to the process of merging adjacent small Regions by scheduling. It serves to avoid unnecessary resources consumption by a large number of small or even empty Regions after data deletion. The component used is `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all Regions in the background, and generates a scheduling task when contiguous small Regions are found. + +## Query scheduling status + +You can check the status of scheduling system through Metrics, pd-ctl and log. This section briefly introduces methods of Metrics and pd-ctl. Refer to [PD monitoring metrics](/dev/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control user guide](/dev/reference/reference/tools/pd-control.md) for details. + +### Operator status + +**Grafana PD/Operator** page shows the statistics about Operators, among which: + +- Schedule Operator Create: information include the reason and target scheduler created by a Operator +- Operator finish duration: execution time consumed by each Operator +- Operator Step duration: execution time consumed by each Operator Step + +You can query Operators using pd-ctl with the following commands: + +- `operator show`: query all Operators generated iny the current scheduling task +- `operator show [admin | leader | region]`: query Operators by type + +### Balance status + +**Grafana PD/Statistics - Balance** page shows the statistics about load balancing, among which: + +- Store Leader/Region score: score of each Store +- Store Leader/Region count: the number of Leaders/Regions in each Store +- Store available: remaining storage of each Store + +You can use store commands of pd-ctl to query balance status of each Store. + +### Hotspot status + +**Grafana PD/Statistics - hotspot** page shows the statistics about hotspots, among which: + +- Hot write Region’s leader/peer distribution: Leader/Peer distribution in hot write Regions +- Hot read Region’s leader distribution: Leader distribution in hot read Regions + +You can also query the status of hotspots using pd-ctl with the following commands: + +- `hot read`: query hot read Regions +- `hot write`: query hot write Regions +- `hot store`: query the distribution of hot Regions by Store +- `region topread [limit]`: query the Region with top read traffic +- `region topwrite [limit]`: query the Region with top write traffic + +### Region health + +**Grafana PD/Cluster/Region health** panel shows the statistics about Regions in abnormal states, include Pending Peer, Down Peer, Offline Peer and Regions with extra or few Peers. + +You can query the list of Regions in abnormal conditions using pd-ctl with region check commands: + +- `region check miss-peer`: Regions without enough Peers +- `region check extra-peer`: Regions with extra Peers +- `region check down-peer`: Regions with Down Peers +- `region check pending-peer`: Regions with Pending Peers + +## Scheduling strategy control + +You can use pd-ctl to adjust the scheduling strategy from the following three aspects. Refer to [PD Control](/dev/reference/tools/pd-control.md) for more details. + +### Start-stop scheduler + +pd-ctl supports dynamically creating and deleting Schedulers. You can use the following commands to control the scheduling behavior of PD: + +- `scheduler show`: show currently working Schedulers in the system +- `scheduler remove balance-leader-scheduler`: delete (disable) balance-leader-scheduler +- `scheduler add evict-leader-scheduler-1`: add a scheduler to remove all Leaders in Store 1 + +### Add Operators manually + +Pd also supports creating or removing Schedulers directly through pd-ctl. For example: + +- `operator add add-peer 2 5`: add Peers to Region 2 in Store 5 +- `operator add transfer-leader 2 5`: migrate Region 2 Leader to Store 5 +- `operator add split-region 2`: split Region 2 into two Regions evenly in size +- `operator remove 2`: remove currently pending Operator in Region 2 + +### Adjust scheduling parameter + +You can check the scheduling configuration using pd-ctl with `config show` command, and adjust the value using `config set {key} {value}`. Common adjustments include: + +- `leader-schedule-limit`: control the number of concurrency of Transfer Leader scheduling +- `region-schedule-limit`: control the number of concurrency of adding/rdeleting Peer scheduling +- `disable-replace-offline-replia`: stop taking nodes offline +- `disable-location-replacement`: stop adjusting the isolation level of Regions +- `max-snapshot-count`: control the maximum of sent/received Snapshots concurrently of each Store + +## PD scheduling in common scenarios + +This section illustrates the best practice of PD scheduling strategies through several scenarios and their scheduling plans. + +### Leader/Region is not evenly distributed + +The grading mechanism of PD determines that Leader Count and Region Count of different Stores cannot fully explain the load balance status. Therefore, it is necessary to confirm whether there is load imbalance from the actual load of TiKV or Storage usage. + +Once you have confirmed that Leader/Region is not evenly distributed, you need to check the grading of different Stores. + +If the scores of different Stores are close, it means PD mistakenly believes that Leader/Region is evenly distributed. Possible reasons are: + +- There are hotspots which cause load imbalance. In such case, you need to collect information about hot Regions scheduling before taking the next step. For more details, refer to [hotspot scheduling](#hot-regions-is-not-evenly-distributed) below. +- There are a large number of empty Regions or small Regions, which leads to a great difference in the number of Leaders in different Stores and further burdens raftstore. This is the time for a Region Merge and quicken merging process. For more details, refer to the [Region Merge](#the-speed-of-region-merge-is-slow) section below. +- Hardware and software environment varies from Store to Store. You can accordingly adjust the value of `leader-weight` and `region-weight` to control the distribution of Leader/Region. +- Other unknown reasons. Still you can adjust the value of `leader-weight` and `region-weight` to control the distribution of Leader/Region. + +If there is a big difference in the grading of different Stores, you need to examine the Operator-related metrics, with special focus on the generation and execution of Operators. There are two situations in general: + +- When a Operator is generated but processes slow, it is possible that: + + - the scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `region-schedule-limit` to a larger value without significantly impacting application. In addition, the `max-pending-peer-count` and `max-snapshot-count` restrictions can also be properly adjusted. + - other scheduling tasks are running concurrently and competing in the system, which slows down the balancing speed. In this case, if the balancing priors to other scheduling tasks, you can stop other tasks or limit their speed. For example, if you take some nodes offline when Regions are rebalancing, both operations consume the quota of `region-schedule-limit`. You can limit the speed of taking nodes offline, or simply set `disable-replace-offline-replica = true` to temporarily shut it down. + - The Operator processes too slow. You can check the time taken by Operator Steps to confirm. Generally, steps that do not involve sending and receiving snapshots (such as TransferLeader, RemovePeer, PromoteLearner, etc.) should be completed in milliseconds, while steps that involve snapshots (such as AddLearner, AddPeer, etc.) should be completed in tens of seconds. If the time taken is obviously too high, it is possible due to the excessive pressure of TiKV or the bottleneck of network, etc., which needs specific analysis. + +- PD fails to generate the corresponding balancing task. Possible reasons include: + + - Scheduler is not enabled. For example, the corresponding Scheduler is deleted, or limit being set to 0. + - other constraints. For example, `evict-leader-scheduler` in the system prevents Leaders from being migrating to the corresponding Store. Or Label property is set, which makes some Stores reject Leaders. + - the restrictions of cluster topology. For example, in a cluster of 3 replicas and 3 data centers, 3 replicas of each Region are distributed in different data centers due to replica isolation. If the number of Stores of these data centers are different, the final scheduling reaches a balanced but globally unbalanced state in each data center. + +### The speed of taking nodes offline is slow + +This scenario requires examining the generation and execution of Operators through related metrics. + +When a Operator is successfully generated but processes slow, possible reasons are: + +- the schedule speed is limit by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to a larger value. Similarly, `max-pending-peer-count` and `max-snapshot-count` can also be properly enlarged. +- other scheduling tasks are running concurrently and competing in the system. You can refer to the solution in [the previous section](#leaderregion-is-not-evenly-distributed). +- when you take a single node offline, since a number of Region Leaders to be operated are concentrated on the offline node (about 1/3 under the configuration of 3 replicas), the speed is limited by the speed at which this single node generates Snapshots. You can speed it up by manually adding an `evict-leader-scheduler` to migrate Leaders. + +If the corresponding Operator fails to generate, possible reasons are: + +- The Operator is stopped, or `replica-schedule-limit` is set to 0. +- there is no proper node to migrate Regions. For example, if the capacity of nodes that replace the nodes of same Label is larger than 80%, PD will stop scheduling to avoid the risk of storage failure. In such case, you need to add more nodes or delete some data to free space. + +### The speed of putting nodes online is slow + +Currently, to take nodes online is scheduled through balance region mechanism, so you can refer to [Leader/Region is not evenly distributed](#leaderregion-is-not-evenly-distributed) for troubleshooting. + +### Hotspots are not evenly distributed + +Hotspots scheduling generally has the following problems: + +- There is a majority of hot Regions, but the scheduling speed cannot keep up with them to redistribute hot Regions in time. + + **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot Regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD sensitive to traffic changes. + +- A single Region with extensive traffics. For example, to scan a small table extensively is required in the production environment, which can also be detected from PD metrics. Since a single hotspot cannot be resolved by redistributing, you need to manually add a `split-region` Operator to redistribute such a Region. + +- the load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis. So it is possible that PD fails to identify hotspots in certain scenarios. For example, some Regions have a large number of point-and-check requests, which are not significant in terms of traffic, but high QPS of which leads to bottlenecks in key modules. + + **Solutions**: Firstly, locate the table with extensive traffic by examining operational needs, and add a `scatter-range-scheduler` to make all Regions of this table are evenly distributed. TiDB also provides an interface in its HTTP AIP to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. + +### The speed of Region Merge is slow + +Similar to the slow scheduling discussed earlier, the speed of Region Merge is most likely limited by default (`merge-schedule-limit` and `region-schedule-limit`), or Region Merge is competing with other schedulers. Specifically, the solutions are: + +- if it is known from statistics that there are a large number of empty Regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to a smaller value to speed up the merging. This is because merging involves replica migration, so the smaller the Region to be merged, the faster. If the generated Merge Operator is already has hundreds of opm, to further speed up the merging process, you can set `patrol-Region-interval` to `10ms`. This will make Region scanning faster but consume more CPU. + +- a lot of tables have been created and then emptied (including truncated tables). These empty Regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: + + - TiKV: set `split-region-on-table` to `false` + - PD: set `namespace-classifier` to "" + +For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of Regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of `max-merge-region-keys`. To avoid this problem, you can adjust `max-merge-region-keys` to a larger value. + +### TiKV node troubleshooting + +If a TiKV node fails, after 30 minutes (customizable by configuration item `max-store-down-time`), PD defaults to setting the corresponding node to "Down" state, and rebalancing replicas for Regions involved. + +Practically, if a node is deemed unrecoverable, you can immediately take it offline. This makes PD rebalance replicas soon and reduces the risk of data loss. In contrast, if a node is deemed recoverable, but might not be available in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout. \ No newline at end of file From e508c8c9e4009a5222d4b0c464df10bc2d931c0d Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Fri, 15 Nov 2019 14:32:59 +0800 Subject: [PATCH 2/8] fix CI --- dev/reference/best-practices/pd-scheduling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md index 5b3449e5b96ba..dabeb804c8005 100644 --- a/dev/reference/best-practices/pd-scheduling.md +++ b/dev/reference/best-practices/pd-scheduling.md @@ -84,7 +84,7 @@ Region merge refers to the process of merging adjacent small Regions by scheduli ## Query scheduling status -You can check the status of scheduling system through Metrics, pd-ctl and log. This section briefly introduces methods of Metrics and pd-ctl. Refer to [PD monitoring metrics](/dev/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control user guide](/dev/reference/reference/tools/pd-control.md) for details. +You can check the status of scheduling system through Metrics, pd-ctl and log. This section briefly introduces methods of Metrics and pd-ctl. Refer to [PD monitoring metrics](/dev/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control user guide](/dev/reference/tools/pd-control.md) for details. ### Operator status @@ -143,7 +143,7 @@ You can use pd-ctl to adjust the scheduling strategy from the following three as pd-ctl supports dynamically creating and deleting Schedulers. You can use the following commands to control the scheduling behavior of PD: -- `scheduler show`: show currently working Schedulers in the system +- `scheduler show`: show currently running Schedulers in the system - `scheduler remove balance-leader-scheduler`: delete (disable) balance-leader-scheduler - `scheduler add evict-leader-scheduler-1`: add a scheduler to remove all Leaders in Store 1 From faac14cc5e90659e44b8bd80a06db8e167213b29 Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Thu, 21 Nov 2019 11:32:04 +0800 Subject: [PATCH 3/8] address comments --- dev/reference/best-practices/pd-scheduling.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md index dabeb804c8005..3971558214b70 100644 --- a/dev/reference/best-practices/pd-scheduling.md +++ b/dev/reference/best-practices/pd-scheduling.md @@ -31,7 +31,7 @@ Scheduling process generally has three steps: 1. Collect information - Each TiKV node periodically reports two heartbeats to PD: `StoreHeartbeat` and `RegionHeartbeat`. `StoreHeartbeat` contains the overall information of Stores including disk capacity, remaining storage, reads and writes traffic. `RegionHeartbeat` contains the overall information of Regions including the range of each Region, peer distribution, peer status, data volume, and reads and writes traffic. PD collects and restores these information for scheduling decisions. + Each TiKV node periodically reports two types of heartbeats to PD: `StoreHeartbeat` and `RegionHeartbeat`. `StoreHeartbeat` contains the overall information of Stores including disk capacity, remaining storage, read/write traffic. `RegionHeartbeat` contains the overall information of Regions including the range of each Region, peer distribution, peer status, data volume, and reads and writes traffic. PD collects and restores these information for scheduling decisions. 2. Generate Operators @@ -50,7 +50,7 @@ Scheduling process generally has three steps: ### Load balancing -Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing Regions evenly across all Stores in the cluster but with separate focuses: `balance-leader` attends Region Leader to balance incoming client requests. `balance-region` centers each Region Peer to redistribute the pressure of storage while avoid exceptions like storage failure. +Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing Regions evenly across all Stores in the cluster but with separate focuses: `balance-leader` attends Region Leader to balance incoming client requests. `balance-region` centers each Region Peer to redistribute the pressure of storage while avoid exceptions like out of storage space. `balance-leader` and `balance-region` share a similar scheduling process. First, they grade different Stores according to their amount of resources. Then, `balance-leader` and `balance-region` constantly select Leaders or Peers from Stores with high scores and transfer them to Stores with low scores. Their grading methods vary though: `balance-leader` uses the sum of all Region Sizes corresponding to Leaders in a Store whereas the way of `balance-region` is relatively complicated. Since the storage capacity of different nodes might be inconsistent, the grading of `balance-region` is: @@ -149,7 +149,7 @@ pd-ctl supports dynamically creating and deleting Schedulers. You can use the fo ### Add Operators manually -Pd also supports creating or removing Schedulers directly through pd-ctl. For example: +Pd also supports creating or removing Operators directly through pd-ctl. For example: - `operator add add-peer 2 5`: add Peers to Region 2 in Store 5 - `operator add transfer-leader 2 5`: migrate Region 2 Leader to Store 5 @@ -210,7 +210,7 @@ When a Operator is successfully generated but processes slow, possible reasons a If the corresponding Operator fails to generate, possible reasons are: - The Operator is stopped, or `replica-schedule-limit` is set to 0. -- there is no proper node to migrate Regions. For example, if the capacity of nodes that replace the nodes of same Label is larger than 80%, PD will stop scheduling to avoid the risk of storage failure. In such case, you need to add more nodes or delete some data to free space. +- there is no proper node to migrate Regions. For example, if the capacity of nodes that replace the nodes of same Label is larger than 80%, PD will stop scheduling to avoid running out of storage space. In such case, you need to add more nodes or delete some data to free space. ### The speed of putting nodes online is slow From e92c3f160078edc9b04fb813859e48d3354a9e0e Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Wed, 27 Nov 2019 15:04:43 +0800 Subject: [PATCH 4/8] address comments and refine wording --- dev/glossary.md | 34 ++-- dev/reference/best-practices/pd-scheduling.md | 185 ++++++++++-------- 2 files changed, 119 insertions(+), 100 deletions(-) diff --git a/dev/glossary.md b/dev/glossary.md index d9c170dd54c60..cc256734ecf64 100644 --- a/dev/glossary.md +++ b/dev/glossary.md @@ -10,13 +10,13 @@ category: glossary ### Leader/Follower/Learner -Leader/Follower/Learner each corresponds to three roles in a group of [Peers](#regionpeerraft-group). The Leader services all client requests and replicated data to the Followers. If the group Leader fails, one of the Followers will be elected as the new Leader. Learners are non-voting Followers that only synchronizes raft logs, and currently exists briefly in the process of replica addition. +Leader/Follower/Learner each corresponds to a role in a Raft group of [Peers](#regionpeerraft-group). The Leader services all client requests and replicated data to the Followers. If the group Leader fails, one of the Followers will be elected as the new Leader. Learners are non-voting Followers that only replicates raft logs, and currently exists briefly in the process of replica addition. ## O ### Operator -An Operator is a collection of actions that applies to a Region and serves a scheduling purpose. For example, "migrate Region 2 Leader to Store 5", "migrate a replica of Region 2 to Store 1, 4, 5". +An Operator is a collection of actions that applies to a Region for scheduling purposes. An Operator perform tasks such as "migrate Region 2 Leader to Store 5", "migrate a replica of Region 2 to Store 1, 4, 5". An Operator can be computed and generated by a Scheduler, or created by an external API. @@ -26,28 +26,28 @@ An Operator Step is a step in the execution of an Operator. An Operator normally Currently, available Steps generated by PD include: -- `TransferLeader`: migrate a Region Leader to a specified Peer -- `AddPeer`: add Followers to a specified Store -- `RemovePeer`: delete a Region Peer -- `AddLearner`: add a Region Learner to a specified Store -- `PromoteLearner`: promote a specified Learner to a voting member -- `SplitRegion`: split a Region in two +- `TransferLeader`: migrates a Region Leader to a specified Peer +- `AddPeer`: adds Followers to a specified Store +- `RemovePeer`: removes a Region Peer +- `AddLearner`: adds a Region Learner to a specified Store +- `PromoteLearner`: promotes a specified Learner to a voting member +- `SplitRegion`: splits a Region in two ## P ### `Pending`/`Down` -`Pending` and `Down` are two special states of Peer. `Pending` indicates that the raft log of Followers or Learners is vastly different from that of Leader, and Followers in `Pending` cannot be elected as Leader. `Down` refers to a state that a Peer ceases to respond to the corresponding Leader for a long time, which usually means that the corresponding node is down or isolated from the network. +`Pending` and `Down` are two special states of a Peer. `Pending` indicates that the Raft log of Followers or Learners is vastly different from that of Leader, and Followers in `Pending` cannot be elected as Leader. `Down` refers to a state that a Peer ceases to respond to the corresponding Leader for a long time, which usually means the corresponding node is down or isolated from the network. ## R ### Region/Peer/Raft Group -Each Region maintains a continuous piece of data for the cluster (an average of about 96 MiB by default), each of which stores multiple replicas in different Stores (3 replicas by default) and each replica is referred as a Peer. Multiple Peers of the same Region synchronize data via raft protocol, so Peers also refer to members of a raft instance. TiKV uses the multi-raft pattern to manage data, that is, each Region has a corresponding, standalone raft instance (also known as a Raft Group). +Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each Region has multiple replicas in different Stores (3 replicas by default). Each replica is referred to as a Peer. Multiple Peers of the same Region replicate data via the Raft protocol, so Peers are also members of a raft instance. TiKV uses multiple Raft group (Multi-Raft) to manage data. That is, for each Region, there is a corresponding, isolated Raft Group. ### Region Split -Regions in the TiKV cluster are gradually split and generated as the written data accrues. The process of splitting is called Region Split. +Regions in the TiKV cluster are gradually split and generated as data write continues. The process of splitting is called Region Split. The mechanism of Region Split is to build an initial Region to cover the entire key space in cluster initialization, and then generate a new Region through Split every time the Region data reaches a certain amount. @@ -55,13 +55,13 @@ The mechanism of Region Split is to build an initial Region to cover the entire ### Scheduler -Scheduler is a component in PD that generates scheduling tasks. Each scheduler in PD runs independently and serves different purposes. Common schedulers and their purposes are: +Schedulers are components in PD that generate scheduling tasks. Each scheduler in PD runs independently and serves different purposes. The commonly used schedulers are: -- `balance-leader-scheduler`: maintain Leader balance of different nodes -- `balance-region-scheduler`: maintain Peer balance of different nodes -- `hot-region-scheduler`: maintain hot Region balance of different nodes -- `evict-leader-{store-id}`: remove all Leaders of a node (often used for rolling upgrades) +- `balance-leader-scheduler`: maintains Leader balance of different nodes +- `balance-region-scheduler`: maintains Peer balance of different nodes +- `hot-region-scheduler`: maintains hot Region balance of different nodes +- `evict-leader-{store-id}`: evicts all Leaders of a node (often used for rolling upgrades) ### Store -A Store in PD refers to the storage node in the cluster, also a instance of tikv-server. Each Store has a corresponding TiKV instance, which means if multiple TiKV instances are deployed on the same host or even on the same disk, these instances still correspond to different Stores. +A Store in PD refers to the storage node in the cluster (an instance of `tikv-server`). Each Store has a corresponding TiKV instance. This means if multiple TiKV instances are deployed on the same host or even on the same disk, these instances still correspond to different Stores. diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md index 3971558214b70..1ad6ef6870f72 100644 --- a/dev/reference/best-practices/pd-scheduling.md +++ b/dev/reference/best-practices/pd-scheduling.md @@ -6,7 +6,7 @@ category: reference # PD Scheduling -This document details the principles and strategies of PD scheduling through common scenarios to facilitate user application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: +This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: - [Leader/Follower/Learner](/dev/glossary.md#leaderfollowerlearner) - [Operator](/dev/glossary.md#operator) @@ -19,85 +19,104 @@ This document details the principles and strategies of PD scheduling through com > **Note:** > -> This document initially targets TiDB 3.0. Though some functions are not supported in earlier versions (2.x), but this document still can be used as a reference for similar principles. +> This document initially targets TiDB 3.0. Although some features are not supported in earlier versions (2.x), the underlying mechanisms are similar and this document can still be used as a reference. ## PD scheduling policies -This section introduces the principle and process involved in the scheduling system. +This section introduces the principles and processes involved in the scheduling system. ### Scheduling process -Scheduling process generally has three steps: +The Scheduling process generally has three steps: 1. Collect information - Each TiKV node periodically reports two types of heartbeats to PD: `StoreHeartbeat` and `RegionHeartbeat`. `StoreHeartbeat` contains the overall information of Stores including disk capacity, remaining storage, read/write traffic. `RegionHeartbeat` contains the overall information of Regions including the range of each Region, peer distribution, peer status, data volume, and reads and writes traffic. PD collects and restores these information for scheduling decisions. + Each TiKV node periodically reports two types of heartbeats to PD: + + - `StoreHeartbeat`: contains the overall information of Stores, including disk capacity, available storage, and read/write traffic. + - `RegionHeartbeat`: contains the overall information of Regions, including the range of each Region, peer distribution, peer status, data volume, and read/write traffic. + + PD collects and restores this information for scheduling decisions. 2. Generate Operators - Different schedulers generate Operators to be executed based on their own logic and requirements, with the consideration of constraints and limitations include: + Different schedulers generate the Operators based on their own logic and requirements, with constraints such as: - - Do not add Peers to a Store with abnormal states (disconnect, down, busy, out of space, with extensive sent/received Snapshots) - - Do not balance Regions with abnormal states - - Do not attempt to transfer a Leader to a Pending Peer - - Do not attempt to remove a Leader directly + - Do not add Peers to a Store in abnormal states (disconnected, down, busy, out of space) + - Do not balance Regions in abnormal states + - Do not transfer a Leader to a Pending Peer + - Do not remove a Leader directly - Do not break the physical isolation of various Region Peers - - Do not break constraints such as Label property + - Do not violate constraints such as Label property + +3. Execute Operators + + To execute the Operators, the general procedure is: + 1. The generated Operator first joins a queue managed by `OperatorController`. -3. Execute scheduling tasks + 2. `OperatorController` takes the Operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to distribute each Operator Step to the corresponding Region Leader. - The generated Operator first joins a queue managed by `OperatorController` rather than be executed immediately. The `OperatorController` takes the Operator out of the queue and executes it with a certain amount of concurrency based on the configuration. The procedure is to distribute each Operator Step to the corresponding Region Leader. In the end, the Operator is marked as finish or timeout and removed from the execution list. + 3. The Operator is marked as "finish" or "timeout" and removed from the queue. ### Load balancing -Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing Regions evenly across all Stores in the cluster but with separate focuses: `balance-leader` attends Region Leader to balance incoming client requests. `balance-region` centers each Region Peer to redistribute the pressure of storage while avoid exceptions like out of storage space. +Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing Regions evenly across all Stores in the cluster but with separate focuses: `balance-leader` deals with Region Leader to balance incoming client requests, whereas `balance-region` concerns itself with each Region Peer to redistribute the pressure of storage and avoid exceptions like out of storage space. -`balance-leader` and `balance-region` share a similar scheduling process. First, they grade different Stores according to their amount of resources. Then, `balance-leader` and `balance-region` constantly select Leaders or Peers from Stores with high scores and transfer them to Stores with low scores. Their grading methods vary though: `balance-leader` uses the sum of all Region Sizes corresponding to Leaders in a Store whereas the way of `balance-region` is relatively complicated. Since the storage capacity of different nodes might be inconsistent, the grading of `balance-region` is: +`balance-leader` and `balance-region` share a similar scheduling process: -- based on the amount of data when there is abundant storage (to make the amount of data of different nodes balanced) -- based on the remaining storage when there is insufficient storage (to make the remaining storage of different nodes balanced) -- based on the weighted sum of the two factors above when storage is neither abundant nor insufficient. +1. Rate Stores according to their availability of resources. +2. `balance-leader` or `balance-region` constantly transfer Leaders or Peers from Stores with high scores to those with low scores. -Since different nodes might differ in performance, you can also set the weight of load balance for different Stores. `leader-weight` and `region-weight` are used to control the Leader weight and Region weight respectively (1 by default). For example, when the `leader-weight` of a Store is set to 2, the number of Leaders of the node is about twice as large as that of other nodes after the scheduling is stable. Similarly, when the `leader-weight` of a Store is set to 0.5, the number of Leaders of the node is about half as large as that of other nodes. +However, their rating methods are different. `balance-leader` uses the sum of all Region Sizes corresponding to Leaders in a Store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might be: + +- based on the amount of data when there is sufficient storage (to balance data distribution among nodes). +- based on the available storage when there is insufficient storage (to balance the storage availability on different nodes). +- based on the weighted sum of the two factors above when neither of the above situations applies. + +Since different nodes might differ in performance, you can also set the weight of load balancing for different Stores. `leader-weight` and `region-weight` are used to control the Leader weight and Region weight respectively ("1" by default for both). For example, when the `leader-weight` of a Store is set to "2", the number of Leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a Store is set to "0.5", the number of Leaders on the node is about half as many as that of other nodes. ### Hot Regions scheduling -Hot Regions scheduling uses `hot-region-scheduler`. Currently in v3.0, the only way to count hot Regions is to determine whether the read/write traffic exceeds a certain threshold for a certain period based on the information reported by Stores. Then, redistribute these Regions in a similar way to load balancing. +Use `hot-region-scheduler` for Hot Regions scheduling. Currently in TiDB 3.0, the process is performed as follows: + +1. Count hot Regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by Stores. + +2. Redistribute these Regions in a similar way to load balancing. -For hot write Region, `hot-region-scheduler` attempts to redistribute both Region Peers and Leaders; for hot read Region, `hot-region-scheduler` only redistributes Region Leaders. +For hot write Regions, `hot-region-scheduler` attempts to redistribute both Region Peers and Leaders; for hot read Regions, `hot-region-scheduler` only redistributes Region Leaders. ### Cluster topology awareness -Cluster topology awareness (zone/rack/host awareness) is having the knowledge of cluster topology or more specifically how the different data nodes are distributed. This is to make the different Regions Peers as distributed as possible through scheduling, hence to achieve high availability and disaster recovery. PD continuously scans all Regions in the background. When PD finds that the distribution of Regions is not optimal, it generates a Operator to replace Peers and redistribute Regions. +Cluster topology awareness (zone/rack/host awareness) is having the knowledge of how data is distributed, which enables PD to distribute Region Peers as much as possible. This is how TiKV ensures high availability and disaster recovery. Because PD continuously scans all Regions in the background, when PD finds that the distribution of Regions is not optimal, it generates an Operator to replace Peers and redistribute Regions. -The component to check the distribution of Regions is `replicaChecker`, which is similar to Scheduler except that it cannot be disabled. `replicaChecker` checks and schedules based on the information provided by the configuration item `location-labels`. For example, `[zone, rack, host]` defines a three-tier topology: the cluster is configured with multiple available zones, with multiple racks under each zone and multiple hosts under each rack. PD attempts to schedule Region Peers to different zones first, or to different racks when zones are limited, or to different hosts when racks are limited. +The component to check Region distribution is `replicaChecker`, which is similar to Scheduler except that it cannot be disabled. The `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone, rack, host]` defines a three-tier topology for a cluster. PD attempts to schedule Region Peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. ### Scale-down and failure recovery -Scale-down refers to taking a Store offline. You can use commands to mark the Store as `Offline` while PD replicates the data the offline node held onto other nodes by scheduling. Failure recovery applies when Stores failed and cannot be recovered. In such case, a Region with Peers distributed on the corresponding Store might lose replicas, which requires PD to replenish replicas for these Regions on other nodes. +Scale-down refers to the process when you take a Store offline and mark it as "offline" using a command. PD replicates the Regions on the offline node to other nodes by scheduling. Failure recovery applies when Stores failed and cannot be recovered. In this case, Regions with Peers distributed on the corresponding Store might lose replicas, which requires PD to replenish on other nodes. -The processes of Scale-down and failure recovery are basically the same. `replicaChecker` finds a Region Peer with abnormal states, and then generates a scheduling task to create a new replica on a healthy Store to replace the abnormal one. +The processes of Scale-down and failure recovery are basically the same. `replicaChecker` finds a Region Peer in abnormal states, and then generates an Operator to replace the abnormal Peer with a new one on a healthy Store. ### Region merge -Region merge refers to the process of merging adjacent small Regions by scheduling. It serves to avoid unnecessary resources consumption by a large number of small or even empty Regions after data deletion. The component used is `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all Regions in the background, and generates a scheduling task when contiguous small Regions are found. +Region merge refers to the process of merging adjacent small Regions by scheduling. It serves to avoid unnecessary resource consumption by a large number of small or even empty Regions after data deletion. Region merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all Regions in the background, and generates an Operator when continuous small Regions are found. ## Query scheduling status -You can check the status of scheduling system through Metrics, pd-ctl and log. This section briefly introduces methods of Metrics and pd-ctl. Refer to [PD monitoring metrics](/dev/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control user guide](/dev/reference/tools/pd-control.md) for details. +You can check the status of scheduling system through Metrics, pd-ctl and logs. This section briefly introduces the methods of Metrics and pd-ctl. Refer to [PD monitoring metrics](/dev/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control](/dev/reference/tools/pd-control.md) for details. ### Operator status -**Grafana PD/Operator** page shows the statistics about Operators, among which: +The **Grafana PD/Operator** page shows the statistics about Operators, among which: -- Schedule Operator Create: information include the reason and target scheduler created by a Operator -- Operator finish duration: execution time consumed by each Operator -- Operator Step duration: execution time consumed by each Operator Step +- Schedule Operator Create: Operator creating information, such as the creating reason and the target scheduler +- Operator finish duration: execution time consumed by the Operator +- Operator Step duration: execution time consumed by the Operator Step You can query Operators using pd-ctl with the following commands: -- `operator show`: query all Operators generated iny the current scheduling task -- `operator show [admin | leader | region]`: query Operators by type +- `operator show`: queries all Operators generated in the current scheduling task +- `operator show [admin | leader | region]`: queries Operators by type ### Balance status @@ -105,35 +124,35 @@ You can query Operators using pd-ctl with the following commands: - Store Leader/Region score: score of each Store - Store Leader/Region count: the number of Leaders/Regions in each Store -- Store available: remaining storage of each Store +- Store available: available storage on each Store You can use store commands of pd-ctl to query balance status of each Store. -### Hotspot status +### Hot Region status -**Grafana PD/Statistics - hotspot** page shows the statistics about hotspots, among which: +The **Grafana PD/Statistics - hotspot** page shows the statistics about hot Regions, among which: - Hot write Region’s leader/peer distribution: Leader/Peer distribution in hot write Regions - Hot read Region’s leader distribution: Leader distribution in hot read Regions You can also query the status of hotspots using pd-ctl with the following commands: -- `hot read`: query hot read Regions -- `hot write`: query hot write Regions -- `hot store`: query the distribution of hot Regions by Store -- `region topread [limit]`: query the Region with top read traffic -- `region topwrite [limit]`: query the Region with top write traffic +- `hot read`: queries hot read Regions +- `hot write`: queries hot write Regions +- `hot store`: queries the distribution of hot Regions by Store +- `region topread [limit]`: queries the Region with top read traffic +- `region topwrite [limit]`: queries the Region with top write traffic ### Region health -**Grafana PD/Cluster/Region health** panel shows the statistics about Regions in abnormal states, include Pending Peer, Down Peer, Offline Peer and Regions with extra or few Peers. +The **Grafana PD/Cluster/Region health** panel shows the statistics about Regions in abnormal states, include Pending Peer, Down Peer, Offline Peer and Regions with extra or few Peers. You can query the list of Regions in abnormal conditions using pd-ctl with region check commands: -- `region check miss-peer`: Regions without enough Peers -- `region check extra-peer`: Regions with extra Peers -- `region check down-peer`: Regions with Down Peers -- `region check pending-peer`: Regions with Pending Peers +- `region check miss-peer`: queries Regions without enough Peers +- `region check extra-peer`: queries Regions with extra Peers +- `region check down-peer`: queries Regions with Down Peers +- `region check pending-peer`: queries Regions with Pending Peers ## Scheduling strategy control @@ -143,82 +162,82 @@ You can use pd-ctl to adjust the scheduling strategy from the following three as pd-ctl supports dynamically creating and deleting Schedulers. You can use the following commands to control the scheduling behavior of PD: -- `scheduler show`: show currently running Schedulers in the system -- `scheduler remove balance-leader-scheduler`: delete (disable) balance-leader-scheduler -- `scheduler add evict-leader-scheduler-1`: add a scheduler to remove all Leaders in Store 1 +- `scheduler show`: shows currently running Schedulers in the system +- `scheduler remove balance-leader-scheduler`: removes (disable) balance-leader-scheduler +- `scheduler add evict-leader-scheduler-1`: adds a scheduler to remove all Leaders in Store 1 ### Add Operators manually Pd also supports creating or removing Operators directly through pd-ctl. For example: -- `operator add add-peer 2 5`: add Peers to Region 2 in Store 5 -- `operator add transfer-leader 2 5`: migrate Region 2 Leader to Store 5 -- `operator add split-region 2`: split Region 2 into two Regions evenly in size -- `operator remove 2`: remove currently pending Operator in Region 2 +- `operator add add-peer 2 5`: adds Peers to Region 2 in Store 5 +- `operator add transfer-leader 2 5`: migrates Region 2 Leader to Store 5 +- `operator add split-region 2`: splits Region 2 into two Regions evenly in size +- `operator remove 2`: removes currently pending Operator in Region 2 ### Adjust scheduling parameter -You can check the scheduling configuration using pd-ctl with `config show` command, and adjust the value using `config set {key} {value}`. Common adjustments include: +You can check the scheduling configuration using the `config show` command in pd-ctl, and adjust the values using `config set {key} {value}`. Common adjustments include: -- `leader-schedule-limit`: control the number of concurrency of Transfer Leader scheduling -- `region-schedule-limit`: control the number of concurrency of adding/rdeleting Peer scheduling -- `disable-replace-offline-replia`: stop taking nodes offline -- `disable-location-replacement`: stop adjusting the isolation level of Regions -- `max-snapshot-count`: control the maximum of sent/received Snapshots concurrently of each Store +- `leader-schedule-limit`: controls the concurrency of Transfer Leader scheduling +- `region-schedule-limit`: controls the concurrency of adding/deleting Peer scheduling +- `disable-replace-offline-replica`: determines whether to disable the scheduling to take nodes offline +- `disable-location-replacement`: determines whether to disable the scheduling that handles the isolation level of Regions +- `max-snapshot-count`: controls the maximum concurrency of sent/received Snapshots for each Store ## PD scheduling in common scenarios -This section illustrates the best practice of PD scheduling strategies through several scenarios and their scheduling plans. +This section illustrates the best practices of PD scheduling strategies through several typical scenarios. -### Leader/Region is not evenly distributed +### Leader/Region are not evenly balanced -The grading mechanism of PD determines that Leader Count and Region Count of different Stores cannot fully explain the load balance status. Therefore, it is necessary to confirm whether there is load imbalance from the actual load of TiKV or Storage usage. +The rating mechanism of PD determines that Leader Count and Region Count of different Stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or Storage usage. -Once you have confirmed that Leader/Region is not evenly distributed, you need to check the grading of different Stores. +Once you have confirmed that Leader/Region is not evenly distributed, you need to check the rating of different Stores. If the scores of different Stores are close, it means PD mistakenly believes that Leader/Region is evenly distributed. Possible reasons are: -- There are hotspots which cause load imbalance. In such case, you need to collect information about hot Regions scheduling before taking the next step. For more details, refer to [hotspot scheduling](#hot-regions-is-not-evenly-distributed) below. -- There are a large number of empty Regions or small Regions, which leads to a great difference in the number of Leaders in different Stores and further burdens raftstore. This is the time for a Region Merge and quicken merging process. For more details, refer to the [Region Merge](#the-speed-of-region-merge-is-slow) section below. -- Hardware and software environment varies from Store to Store. You can accordingly adjust the value of `leader-weight` and `region-weight` to control the distribution of Leader/Region. -- Other unknown reasons. Still you can adjust the value of `leader-weight` and `region-weight` to control the distribution of Leader/Region. +- There are hot Regions that cause load imbalancing. In this case, you need to analyze further based on [hot Regions scheduling](#hot-regions-are-not-evenly-distributed). +- There are a large number of empty Regions or small Regions, which leads to a great difference in the number of Leaders in different Stores and high pressure on raftstore. This is the time for a [Region Merge](#the-speed-of-region-merge-is-slow) scheduling. +- Hardware and software environment varies among Stores. You can adjust the values of `leader-weight` and `region-weight` accordingly to control the distribution of Leader/Region. +- Other unknown reasons. Still you can adjust the values of `leader-weight` and `region-weight` to control the distribution of Leader/Region. -If there is a big difference in the grading of different Stores, you need to examine the Operator-related metrics, with special focus on the generation and execution of Operators. There are two situations in general: +If there is a big difference in the rating of different Stores, you need to examine the Operator-related metrics, with special focus on the generation and execution of Operators. There are two main situations: - When a Operator is generated but processes slow, it is possible that: - - the scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `region-schedule-limit` to a larger value without significantly impacting application. In addition, the `max-pending-peer-count` and `max-snapshot-count` restrictions can also be properly adjusted. - - other scheduling tasks are running concurrently and competing in the system, which slows down the balancing speed. In this case, if the balancing priors to other scheduling tasks, you can stop other tasks or limit their speed. For example, if you take some nodes offline when Regions are rebalancing, both operations consume the quota of `region-schedule-limit`. You can limit the speed of taking nodes offline, or simply set `disable-replace-offline-replica = true` to temporarily shut it down. + - The scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `region-schedule-limit` to a larger value without significantly impacting application. In addition, the `max-pending-peer-count` and `max-snapshot-count` restrictions can also be properly adjusted. + - Other scheduling tasks are running concurrently and competing in the system, which slows down the balancing speed. In this case, if the balancing priors to other scheduling tasks, you can stop other tasks or limit their speed. For example, if you take some nodes offline when Regions are rebalancing, both operations consume the quota of `region-schedule-limit`. You can limit the speed of taking nodes offline, or simply set `disable-replace-offline-replica = true` to temporarily shut it down. - The Operator processes too slow. You can check the time taken by Operator Steps to confirm. Generally, steps that do not involve sending and receiving snapshots (such as TransferLeader, RemovePeer, PromoteLearner, etc.) should be completed in milliseconds, while steps that involve snapshots (such as AddLearner, AddPeer, etc.) should be completed in tens of seconds. If the time taken is obviously too high, it is possible due to the excessive pressure of TiKV or the bottleneck of network, etc., which needs specific analysis. - PD fails to generate the corresponding balancing task. Possible reasons include: - - Scheduler is not enabled. For example, the corresponding Scheduler is deleted, or limit being set to 0. - - other constraints. For example, `evict-leader-scheduler` in the system prevents Leaders from being migrating to the corresponding Store. Or Label property is set, which makes some Stores reject Leaders. - - the restrictions of cluster topology. For example, in a cluster of 3 replicas and 3 data centers, 3 replicas of each Region are distributed in different data centers due to replica isolation. If the number of Stores of these data centers are different, the final scheduling reaches a balanced but globally unbalanced state in each data center. + - The Scheduler is not activated. For example, the corresponding Scheduler is deleted, or limit being set to 0. + - Other constraints. For example, `evict-leader-scheduler` in the system prevents Leaders from being migrating to the corresponding Store. Or Label property is set, which makes some Stores reject Leaders. + - The restrictions of cluster topology. For example, in a cluster of 3 replicas and 3 data centers, 3 replicas of each Region are distributed in different data centers due to replica isolation. If the number of Stores of these data centers are different, the final scheduling reaches a balanced but globally unbalanced state in each data center. ### The speed of taking nodes offline is slow This scenario requires examining the generation and execution of Operators through related metrics. -When a Operator is successfully generated but processes slow, possible reasons are: +When an Operator is successfully generated but processes slow, possible reasons are: -- the schedule speed is limit by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to a larger value. Similarly, `max-pending-peer-count` and `max-snapshot-count` can also be properly enlarged. -- other scheduling tasks are running concurrently and competing in the system. You can refer to the solution in [the previous section](#leaderregion-is-not-evenly-distributed). -- when you take a single node offline, since a number of Region Leaders to be operated are concentrated on the offline node (about 1/3 under the configuration of 3 replicas), the speed is limited by the speed at which this single node generates Snapshots. You can speed it up by manually adding an `evict-leader-scheduler` to migrate Leaders. +- The schedule speed is limited by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to a larger value. Similarly, `max-pending-peer-count` and `max-snapshot-count` can also be properly enlarged. +- Other scheduling tasks are running concurrently and competing in the system. You can refer to the solution in [the previous section](#leaderregion-is-not-evenly-distributed). +- When you take a single node offline, a number of Region Leaders to be operated are concentrated on the offline node (about 1/3 under the configuration of 3 replicas), so the speed is limited by the speed at which this single node generates Snapshots. You can speed it up by manually adding an `evict-leader-scheduler` to migrate Leaders. If the corresponding Operator fails to generate, possible reasons are: - The Operator is stopped, or `replica-schedule-limit` is set to 0. -- there is no proper node to migrate Regions. For example, if the capacity of nodes that replace the nodes of same Label is larger than 80%, PD will stop scheduling to avoid running out of storage space. In such case, you need to add more nodes or delete some data to free space. +- There is no proper node to migrate Regions. For example, if the capacity of nodes that replace the nodes of same Label is larger than 80%, PD will stop scheduling to avoid running out of storage space. In such case, you need to add more nodes or delete some data to free space. ### The speed of putting nodes online is slow Currently, to take nodes online is scheduled through balance region mechanism, so you can refer to [Leader/Region is not evenly distributed](#leaderregion-is-not-evenly-distributed) for troubleshooting. -### Hotspots are not evenly distributed +### Hot Regions are not evenly distributed -Hotspots scheduling generally has the following problems: +Hot Regions scheduling generally has the following problems: - There is a majority of hot Regions, but the scheduling speed cannot keep up with them to redistribute hot Regions in time. @@ -226,7 +245,7 @@ Hotspots scheduling generally has the following problems: - A single Region with extensive traffics. For example, to scan a small table extensively is required in the production environment, which can also be detected from PD metrics. Since a single hotspot cannot be resolved by redistributing, you need to manually add a `split-region` Operator to redistribute such a Region. -- the load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis. So it is possible that PD fails to identify hotspots in certain scenarios. For example, some Regions have a large number of point-and-check requests, which are not significant in terms of traffic, but high QPS of which leads to bottlenecks in key modules. +- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis. So it is possible that PD fails to identify hotspots in certain scenarios. For example, some Regions have a large number of point-and-check requests, which are not significant in terms of traffic, but high QPS of which leads to bottlenecks in key modules. **Solutions**: Firstly, locate the table with extensive traffic by examining operational needs, and add a `scatter-range-scheduler` to make all Regions of this table are evenly distributed. TiDB also provides an interface in its HTTP AIP to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. @@ -234,9 +253,9 @@ Hotspots scheduling generally has the following problems: Similar to the slow scheduling discussed earlier, the speed of Region Merge is most likely limited by default (`merge-schedule-limit` and `region-schedule-limit`), or Region Merge is competing with other schedulers. Specifically, the solutions are: -- if it is known from statistics that there are a large number of empty Regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to a smaller value to speed up the merging. This is because merging involves replica migration, so the smaller the Region to be merged, the faster. If the generated Merge Operator is already has hundreds of opm, to further speed up the merging process, you can set `patrol-Region-interval` to `10ms`. This will make Region scanning faster but consume more CPU. +- If it is known from statistics that there are a large number of empty Regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to a smaller value to speed up the merging. This is because merging involves replica migration, so the smaller the Region to be merged, the faster. If the generated Merge Operator is already has hundreds of opm, to further speed up the merging process, you can set `patrol-Region-interval` to `10ms`. This will make Region scanning faster but consume more CPU. -- a lot of tables have been created and then emptied (including truncated tables). These empty Regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: +- A lot of tables have been created and then emptied (including truncated tables). These empty Regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: - TiKV: set `split-region-on-table` to `false` - PD: set `namespace-classifier` to "" From e92c345c761b3cb9457a257b3ed52150c73da1ac Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Thu, 28 Nov 2019 20:50:30 +0800 Subject: [PATCH 5/8] address comments --- dev/glossary.md | 34 ++-- dev/reference/best-practices/pd-scheduling.md | 165 +++++++++--------- 2 files changed, 100 insertions(+), 99 deletions(-) diff --git a/dev/glossary.md b/dev/glossary.md index cc256734ecf64..65f03bdf950f6 100644 --- a/dev/glossary.md +++ b/dev/glossary.md @@ -10,13 +10,13 @@ category: glossary ### Leader/Follower/Learner -Leader/Follower/Learner each corresponds to a role in a Raft group of [Peers](#regionpeerraft-group). The Leader services all client requests and replicated data to the Followers. If the group Leader fails, one of the Followers will be elected as the new Leader. Learners are non-voting Followers that only replicates raft logs, and currently exists briefly in the process of replica addition. +Leader/Follower/Learner each corresponds to a role in a Raft group of [Peers](#regionpeerraft-group). The Leader services all client requests and replicates data to the Followers. If the group Leader fails, one of the Followers will be elected as the new Leader. Learners are non-voting Followers that only serves in the process of replica addition. ## O ### Operator -An Operator is a collection of actions that applies to a Region for scheduling purposes. An Operator perform tasks such as "migrate Region 2 Leader to Store 5", "migrate a replica of Region 2 to Store 1, 4, 5". +An Operator is a collection of actions that applies to a Region for scheduling purposes. Operators perform scheduling tasks such as "migrate the Leader of Region 2 to Store 5" and "migrate replicas of Region 2 to Store 1, 4, 5". An Operator can be computed and generated by a Scheduler, or created by an external API. @@ -26,30 +26,30 @@ An Operator Step is a step in the execution of an Operator. An Operator normally Currently, available Steps generated by PD include: -- `TransferLeader`: migrates a Region Leader to a specified Peer -- `AddPeer`: adds Followers to a specified Store -- `RemovePeer`: removes a Region Peer -- `AddLearner`: adds a Region Learner to a specified Store -- `PromoteLearner`: promotes a specified Learner to a voting member -- `SplitRegion`: splits a Region in two +- `TransferLeader`: Transfers Leadership to a specified member +- `AddPeer`: Adds Peers to a specified Store +- `RemovePeer`: Removes a Peer of a Region +- `AddLearner`: Adds Learners to a specified Store +- `PromoteLearner`: Promotes a specified Learner to a voting member +- `SplitRegion`: Splits a specified Region into two ## P ### `Pending`/`Down` -`Pending` and `Down` are two special states of a Peer. `Pending` indicates that the Raft log of Followers or Learners is vastly different from that of Leader, and Followers in `Pending` cannot be elected as Leader. `Down` refers to a state that a Peer ceases to respond to the corresponding Leader for a long time, which usually means the corresponding node is down or isolated from the network. +`Pending` and `Down` are two special states of a Peer. `Pending` indicates that the Raft log of Followers or Learners is vastly different from that of Leader. Followers in `Pending` cannot be elected as Leader. `Down` refers to a state that a Peer ceases to respond to Leader for a long time, which usually means the corresponding node is down or isolated from the network. ## R ### Region/Peer/Raft Group -Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each Region has multiple replicas in different Stores (3 replicas by default). Each replica is referred to as a Peer. Multiple Peers of the same Region replicate data via the Raft protocol, so Peers are also members of a raft instance. TiKV uses multiple Raft group (Multi-Raft) to manage data. That is, for each Region, there is a corresponding, isolated Raft Group. +Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each Region has three replicas by default. A replica of a Region is called a Peer. Multiple Peers of the same Region replicate data via the Raft protocol, so Peers are also members of a Raft instance. TiKV uses Multi-Raft to manage data. That is, for each Region, there is a corresponding, isolated Raft Group. ### Region Split -Regions in the TiKV cluster are gradually split and generated as data write continues. The process of splitting is called Region Split. +Regions are generated as data writes increase. The process of splitting is called Region Split. -The mechanism of Region Split is to build an initial Region to cover the entire key space in cluster initialization, and then generate a new Region through Split every time the Region data reaches a certain amount. +The mechanism of Region Split is to use one initial Region to cover the entire key space, and generate new Regions through splitting existing ones every time the size of the Region or the number of keys has reached a threshold. ## S @@ -57,11 +57,11 @@ The mechanism of Region Split is to build an initial Region to cover the entire Schedulers are components in PD that generate scheduling tasks. Each scheduler in PD runs independently and serves different purposes. The commonly used schedulers are: -- `balance-leader-scheduler`: maintains Leader balance of different nodes -- `balance-region-scheduler`: maintains Peer balance of different nodes -- `hot-region-scheduler`: maintains hot Region balance of different nodes -- `evict-leader-{store-id}`: evicts all Leaders of a node (often used for rolling upgrades) +- `balance-leader-scheduler`: Balances the distribution of Leaders +- `balance-region-scheduler`: Balances the distribution of Peers +- `hot-region-scheduler`: Balances the distribution of hot Regions +- `evict-leader-{store-id}`: Evicts all Leaders of a node (often used for rolling upgrades) ### Store -A Store in PD refers to the storage node in the cluster (an instance of `tikv-server`). Each Store has a corresponding TiKV instance. This means if multiple TiKV instances are deployed on the same host or even on the same disk, these instances still correspond to different Stores. +A Store refers to the storage node in the TiKV cluster (an instance of `tikv-server`). Each Store has a corresponding TiKV instance. \ No newline at end of file diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md index 1ad6ef6870f72..929db88530151 100644 --- a/dev/reference/best-practices/pd-scheduling.md +++ b/dev/reference/best-practices/pd-scheduling.md @@ -33,8 +33,8 @@ The Scheduling process generally has three steps: Each TiKV node periodically reports two types of heartbeats to PD: - - `StoreHeartbeat`: contains the overall information of Stores, including disk capacity, available storage, and read/write traffic. - - `RegionHeartbeat`: contains the overall information of Regions, including the range of each Region, peer distribution, peer status, data volume, and read/write traffic. + - `StoreHeartbeat`: Contains the overall information of Stores, including disk capacity, available storage, and read/write traffic. + - `RegionHeartbeat`: Contains the overall information of Regions, including the range of each Region, peer distribution, peer status, data volume, and read/write traffic. PD collects and restores this information for scheduling decisions. @@ -52,9 +52,10 @@ The Scheduling process generally has three steps: 3. Execute Operators To execute the Operators, the general procedure is: + 1. The generated Operator first joins a queue managed by `OperatorController`. - 2. `OperatorController` takes the Operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to distribute each Operator Step to the corresponding Region Leader. + 2. `OperatorController` takes the Operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to assign each Operator Step to the corresponding Region Leader. 3. The Operator is marked as "finish" or "timeout" and removed from the queue. @@ -64,20 +65,20 @@ Region primarily relies on `balance-leader` and `balance-region` schedulers to a `balance-leader` and `balance-region` share a similar scheduling process: -1. Rate Stores according to their availability of resources. +1. Rate Stores according to their resource availability. 2. `balance-leader` or `balance-region` constantly transfer Leaders or Peers from Stores with high scores to those with low scores. -However, their rating methods are different. `balance-leader` uses the sum of all Region Sizes corresponding to Leaders in a Store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might be: +However, their rating methods are different. `balance-leader` uses the sum of all Region Sizes corresponding to Leaders in a Store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might: - based on the amount of data when there is sufficient storage (to balance data distribution among nodes). - based on the available storage when there is insufficient storage (to balance the storage availability on different nodes). -- based on the weighted sum of the two factors above when neither of the above situations applies. +- based on the weighted sum of the two factors above when neither of the situations applies. -Since different nodes might differ in performance, you can also set the weight of load balancing for different Stores. `leader-weight` and `region-weight` are used to control the Leader weight and Region weight respectively ("1" by default for both). For example, when the `leader-weight` of a Store is set to "2", the number of Leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a Store is set to "0.5", the number of Leaders on the node is about half as many as that of other nodes. +Because different nodes might differ in performance, you can also set the weight of load balancing for different Stores. `leader-weight` and `region-weight` are used to control the Leader weight and Region weight respectively ("1" by default for both). For example, when the `leader-weight` of a Store is set to "2", the number of Leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a Store is set to "0.5", the number of Leaders on the node is about half as many as that of other nodes. ### Hot Regions scheduling -Use `hot-region-scheduler` for Hot Regions scheduling. Currently in TiDB 3.0, the process is performed as follows: +For hot Regions scheduling, use `hot-region-scheduler`. Currently in TiDB 3.0, the process is performed as follows: 1. Count hot Regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by Stores. @@ -87,19 +88,19 @@ For hot write Regions, `hot-region-scheduler` attempts to redistribute both Regi ### Cluster topology awareness -Cluster topology awareness (zone/rack/host awareness) is having the knowledge of how data is distributed, which enables PD to distribute Region Peers as much as possible. This is how TiKV ensures high availability and disaster recovery. Because PD continuously scans all Regions in the background, when PD finds that the distribution of Regions is not optimal, it generates an Operator to replace Peers and redistribute Regions. +Cluster topology awareness enables PD to distribute replicas of a Region as much as possible. This is how TiKV ensures high availability and disaster recovery capability. PD continuously scans all Regions in the background. When PD finds that the distribution of Regions is not optimal, it generates an Operator to replace Peers and redistribute Regions. -The component to check Region distribution is `replicaChecker`, which is similar to Scheduler except that it cannot be disabled. The `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone, rack, host]` defines a three-tier topology for a cluster. PD attempts to schedule Region Peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. +The component to check Region distribution is `replicaChecker`, which is similar to Scheduler except that it cannot be disabled. `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone, rack, host]` defines a three-tier topology for a cluster. PD attempts to schedule Region Peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. ### Scale-down and failure recovery Scale-down refers to the process when you take a Store offline and mark it as "offline" using a command. PD replicates the Regions on the offline node to other nodes by scheduling. Failure recovery applies when Stores failed and cannot be recovered. In this case, Regions with Peers distributed on the corresponding Store might lose replicas, which requires PD to replenish on other nodes. -The processes of Scale-down and failure recovery are basically the same. `replicaChecker` finds a Region Peer in abnormal states, and then generates an Operator to replace the abnormal Peer with a new one on a healthy Store. +The processes of Scale-down and failure recovery are basically the same. `replicaChecker` finds a Region Peer in abnormal states, and then generates an Operator to replace the abnormal Peer with a new one on a healthy Store. -### Region merge +### Region Merge -Region merge refers to the process of merging adjacent small Regions by scheduling. It serves to avoid unnecessary resource consumption by a large number of small or even empty Regions after data deletion. Region merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all Regions in the background, and generates an Operator when continuous small Regions are found. +Region Merge refers to the process of merging adjacent small Regions. It serves to avoid unnecessary resource consumption by a large number of small or even empty Regions after data deletion. Region Merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all Regions in the background, and generates an Operator when contiguous small Regions are found. ## Query scheduling status @@ -109,22 +110,22 @@ You can check the status of scheduling system through Metrics, pd-ctl and logs. The **Grafana PD/Operator** page shows the statistics about Operators, among which: -- Schedule Operator Create: Operator creating information, such as the creating reason and the target scheduler -- Operator finish duration: execution time consumed by the Operator -- Operator Step duration: execution time consumed by the Operator Step +- Schedule Operator Create: Operator creating information +- Operator finish duration: Execution time consumed by each Operator +- Operator Step duration: Execution time consumed by the Operator Step You can query Operators using pd-ctl with the following commands: -- `operator show`: queries all Operators generated in the current scheduling task -- `operator show [admin | leader | region]`: queries Operators by type +- `operator show`: Queries all Operators generated in the current scheduling task +- `operator show [admin | leader | region]`: Queries Operators by type ### Balance status -**Grafana PD/Statistics - Balance** page shows the statistics about load balancing, among which: +The **Grafana PD/Statistics - Balance** page shows the statistics about load balancing, among which: -- Store Leader/Region score: score of each Store -- Store Leader/Region count: the number of Leaders/Regions in each Store -- Store available: available storage on each Store +- Store Leader/Region score: Score of each Store +- Store Leader/Region count: The number of Leaders/Regions in each Store +- Store available: Available storage on each Store You can use store commands of pd-ctl to query balance status of each Store. @@ -135,125 +136,125 @@ The **Grafana PD/Statistics - hotspot** page shows the statistics about hot Regi - Hot write Region’s leader/peer distribution: Leader/Peer distribution in hot write Regions - Hot read Region’s leader distribution: Leader distribution in hot read Regions -You can also query the status of hotspots using pd-ctl with the following commands: +You can also query the status of hot Regions using pd-ctl with the following commands: -- `hot read`: queries hot read Regions -- `hot write`: queries hot write Regions -- `hot store`: queries the distribution of hot Regions by Store -- `region topread [limit]`: queries the Region with top read traffic -- `region topwrite [limit]`: queries the Region with top write traffic +- `hot read`: Queries hot read Regions +- `hot write`: Queries hot write Regions +- `hot store`: Queries the distribution of hot Regions by Store +- `region topread [limit]`: Queries the Region with top read traffic +- `region topwrite [limit]`: Queries the Region with top write traffic ### Region health The **Grafana PD/Cluster/Region health** panel shows the statistics about Regions in abnormal states, include Pending Peer, Down Peer, Offline Peer and Regions with extra or few Peers. -You can query the list of Regions in abnormal conditions using pd-ctl with region check commands: +You can query the list of Regions in abnormal states using pd-ctl with region check commands: -- `region check miss-peer`: queries Regions without enough Peers -- `region check extra-peer`: queries Regions with extra Peers -- `region check down-peer`: queries Regions with Down Peers -- `region check pending-peer`: queries Regions with Pending Peers +- `region check miss-peer`: Queries Regions without enough Peers +- `region check extra-peer`: Queries Regions with extra Peers +- `region check down-peer`: Queries Regions with Down Peers +- `region check pending-peer`: Queries Regions with Pending Peers -## Scheduling strategy control +## Control scheduling strategy You can use pd-ctl to adjust the scheduling strategy from the following three aspects. Refer to [PD Control](/dev/reference/tools/pd-control.md) for more details. -### Start-stop scheduler +### Add/delete Scheduler manually -pd-ctl supports dynamically creating and deleting Schedulers. You can use the following commands to control the scheduling behavior of PD: +PD supports dynamically adding and removing Schedulers directly through pd-ctl. For example: -- `scheduler show`: shows currently running Schedulers in the system -- `scheduler remove balance-leader-scheduler`: removes (disable) balance-leader-scheduler -- `scheduler add evict-leader-scheduler-1`: adds a scheduler to remove all Leaders in Store 1 +- `scheduler show`: Shows currently running Schedulers in the system +- `scheduler remove balance-leader-scheduler`: Removes (disable) balance-leader-scheduler +- `scheduler add evict-leader-scheduler-1`: Adds a scheduler to remove all Leaders in Store 1 -### Add Operators manually +### Add/delete Operators manually -Pd also supports creating or removing Operators directly through pd-ctl. For example: +PD also supports adding or removing Operators directly through pd-ctl. For example: -- `operator add add-peer 2 5`: adds Peers to Region 2 in Store 5 -- `operator add transfer-leader 2 5`: migrates Region 2 Leader to Store 5 -- `operator add split-region 2`: splits Region 2 into two Regions evenly in size -- `operator remove 2`: removes currently pending Operator in Region 2 +- `operator add add-peer 2 5`: Adds Peers to Region 2 in Store 5 +- `operator add transfer-leader 2 5`: Migrates the Leader of Region 2 to Store 5 +- `operator add split-region 2`: Splits Region 2 into two Regions evenly in size +- `operator remove 2`: Removes currently pending Operator in Region 2 ### Adjust scheduling parameter You can check the scheduling configuration using the `config show` command in pd-ctl, and adjust the values using `config set {key} {value}`. Common adjustments include: -- `leader-schedule-limit`: controls the concurrency of Transfer Leader scheduling -- `region-schedule-limit`: controls the concurrency of adding/deleting Peer scheduling -- `disable-replace-offline-replica`: determines whether to disable the scheduling to take nodes offline -- `disable-location-replacement`: determines whether to disable the scheduling that handles the isolation level of Regions -- `max-snapshot-count`: controls the maximum concurrency of sent/received Snapshots for each Store +- `leader-schedule-limit`: Controls the concurrency of Transfer Leader scheduling +- `region-schedule-limit`: Controls the concurrency of adding/deleting Peer scheduling +- `disable-replace-offline-replica`: Determines whether to disable the scheduling to take nodes offline +- `disable-location-replacement`: Determines whether to disable the scheduling that handles the isolation level of Regions +- `max-snapshot-count`: Controls the maximum concurrency of sending/receiving Snapshots for each Store ## PD scheduling in common scenarios This section illustrates the best practices of PD scheduling strategies through several typical scenarios. -### Leader/Region are not evenly balanced +### Leaders/Regions are not evenly distributed The rating mechanism of PD determines that Leader Count and Region Count of different Stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or Storage usage. -Once you have confirmed that Leader/Region is not evenly distributed, you need to check the rating of different Stores. +Once you have confirmed that Leaders/Regiosn are not evenly distributed, you need to check the rating of different Stores. -If the scores of different Stores are close, it means PD mistakenly believes that Leader/Region is evenly distributed. Possible reasons are: +If the scores of different Stores are close, it means PD mistakenly believes that Leaders/Regions are evenly distributed. Possible reasons are: - There are hot Regions that cause load imbalancing. In this case, you need to analyze further based on [hot Regions scheduling](#hot-regions-are-not-evenly-distributed). -- There are a large number of empty Regions or small Regions, which leads to a great difference in the number of Leaders in different Stores and high pressure on raftstore. This is the time for a [Region Merge](#the-speed-of-region-merge-is-slow) scheduling. +- There are a large number of empty Regions or small Regions, which leads to a great difference in the number of Leaders in different Stores and high pressure on Raftstore. This is the time for a [Region Merge](#region-merge-is-slow) scheduling. - Hardware and software environment varies among Stores. You can adjust the values of `leader-weight` and `region-weight` accordingly to control the distribution of Leader/Region. - Other unknown reasons. Still you can adjust the values of `leader-weight` and `region-weight` to control the distribution of Leader/Region. If there is a big difference in the rating of different Stores, you need to examine the Operator-related metrics, with special focus on the generation and execution of Operators. There are two main situations: -- When a Operator is generated but processes slow, it is possible that: +- When Operators are generated normally but the scheduling process is slow, it is possible that: - - The scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `region-schedule-limit` to a larger value without significantly impacting application. In addition, the `max-pending-peer-count` and `max-snapshot-count` restrictions can also be properly adjusted. - - Other scheduling tasks are running concurrently and competing in the system, which slows down the balancing speed. In this case, if the balancing priors to other scheduling tasks, you can stop other tasks or limit their speed. For example, if you take some nodes offline when Regions are rebalancing, both operations consume the quota of `region-schedule-limit`. You can limit the speed of taking nodes offline, or simply set `disable-replace-offline-replica = true` to temporarily shut it down. - - The Operator processes too slow. You can check the time taken by Operator Steps to confirm. Generally, steps that do not involve sending and receiving snapshots (such as TransferLeader, RemovePeer, PromoteLearner, etc.) should be completed in milliseconds, while steps that involve snapshots (such as AddLearner, AddPeer, etc.) should be completed in tens of seconds. If the time taken is obviously too high, it is possible due to the excessive pressure of TiKV or the bottleneck of network, etc., which needs specific analysis. + - The scheduling speed is limited by default for load balancing purpose. You can adjust `leader-schedule-limit` or `region-schedule-limit` to larger values without significantly impacting regular services. In addition, you can also properly ease the restrictions specified by `max-pending-peer-count` and `max-snapshot-count`. + - Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of `region-schedule-limit`. In this case, you can limit the speed of scheduler to remove nodes, or simply set `disable-replace-offline-replica = true` to temporarily disable it. + - The scheduling process is too slow. You can check the **Operator Step duration** metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as `TransferLeader`, `RemovePeer`, `PromoteLearner`) should be completed in milliseconds, while steps that involve snapshots (such as `AddLearner` and `AddPeer`) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis. -- PD fails to generate the corresponding balancing task. Possible reasons include: +- PD fails to generate the corresponding balancing Scheduler. Possible reasons include: - - The Scheduler is not activated. For example, the corresponding Scheduler is deleted, or limit being set to 0. + - The Scheduler is not activated. For example, the corresponding Scheduler is deleted, or its limit it set to "0". - Other constraints. For example, `evict-leader-scheduler` in the system prevents Leaders from being migrating to the corresponding Store. Or Label property is set, which makes some Stores reject Leaders. - - The restrictions of cluster topology. For example, in a cluster of 3 replicas and 3 data centers, 3 replicas of each Region are distributed in different data centers due to replica isolation. If the number of Stores of these data centers are different, the final scheduling reaches a balanced but globally unbalanced state in each data center. + - Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each Region are distributed in different data centers due to replica isolation. If the number of Stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally. -### The speed of taking nodes offline is slow +### Taking nodes offline is slow This scenario requires examining the generation and execution of Operators through related metrics. -When an Operator is successfully generated but processes slow, possible reasons are: +If Operators are successfully generated but the scheduling process is slow, possible reasons are: -- The schedule speed is limited by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to a larger value. Similarly, `max-pending-peer-count` and `max-snapshot-count` can also be properly enlarged. -- Other scheduling tasks are running concurrently and competing in the system. You can refer to the solution in [the previous section](#leaderregion-is-not-evenly-distributed). -- When you take a single node offline, a number of Region Leaders to be operated are concentrated on the offline node (about 1/3 under the configuration of 3 replicas), so the speed is limited by the speed at which this single node generates Snapshots. You can speed it up by manually adding an `evict-leader-scheduler` to migrate Leaders. +- The scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to larger value.s Similarly, you can consider loosening the limits on `max-pending-peer-count` and `max-snapshot-count`. +- Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in [the previous section](#leadersregions-are-not-evenly-distributed). +- When you take a single node offline, a number of Region Leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an `evict-leader-scheduler` to migrate Leaders. If the corresponding Operator fails to generate, possible reasons are: -- The Operator is stopped, or `replica-schedule-limit` is set to 0. -- There is no proper node to migrate Regions. For example, if the capacity of nodes that replace the nodes of same Label is larger than 80%, PD will stop scheduling to avoid running out of storage space. In such case, you need to add more nodes or delete some data to free space. +- The Operator is stopped, or `replica-schedule-limit` is set to "0". +- There is no proper node for Region migration. For example, if the available capacity size of the replacing nodes (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage space. In such case, you need to add more nodes or delete some data to free the space. -### The speed of putting nodes online is slow +### Bringing nodes online is slow -Currently, to take nodes online is scheduled through balance region mechanism, so you can refer to [Leader/Region is not evenly distributed](#leaderregion-is-not-evenly-distributed) for troubleshooting. +Currently, bringing nodes online is scheduled through the balance region mechanism. You can refer to [Leaders/Regions are not evenly distributed](#leadersregions-are-not-evenly-distributed) for troubleshooting. ### Hot Regions are not evenly distributed -Hot Regions scheduling generally has the following problems: +Hot Regions scheduling issues generally fall into the following categories: -- There is a majority of hot Regions, but the scheduling speed cannot keep up with them to redistribute hot Regions in time. +- Hot Regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot Regions in time. - **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot Regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD sensitive to traffic changes. + **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot Regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD more sensitive to traffic changes. -- A single Region with extensive traffics. For example, to scan a small table extensively is required in the production environment, which can also be detected from PD metrics. Since a single hotspot cannot be resolved by redistributing, you need to manually add a `split-region` Operator to redistribute such a Region. +- Hotspot formed on a single Region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a `split-region` Operator to split such a Region. -- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis. So it is possible that PD fails to identify hotspots in certain scenarios. For example, some Regions have a large number of point-and-check requests, which are not significant in terms of traffic, but high QPS of which leads to bottlenecks in key modules. +- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some Regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules. - **Solutions**: Firstly, locate the table with extensive traffic by examining operational needs, and add a `scatter-range-scheduler` to make all Regions of this table are evenly distributed. TiDB also provides an interface in its HTTP AIP to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. + **Solutions**: Firstly, locate the table where hot Regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all Regions of this table evenly distributed. TiDB also provides an interface in its HTTP AIP to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. -### The speed of Region Merge is slow +### Region Merge is slow -Similar to the slow scheduling discussed earlier, the speed of Region Merge is most likely limited by default (`merge-schedule-limit` and `region-schedule-limit`), or Region Merge is competing with other schedulers. Specifically, the solutions are: +Similar to slow scheduling, the speed of Region Merge is most likely limited by the configurations of `merge-schedule-limit` and `region-schedule-limit`, or the Region Merge scheduler is competing with other schedulers. Specifically, the solutions are: -- If it is known from statistics that there are a large number of empty Regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to a smaller value to speed up the merging. This is because merging involves replica migration, so the smaller the Region to be merged, the faster. If the generated Merge Operator is already has hundreds of opm, to further speed up the merging process, you can set `patrol-Region-interval` to `10ms`. This will make Region scanning faster but consume more CPU. +- If it is known from statistics that there are a large number of empty Regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the Region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can set `patrol-region-interval` to `10ms`. This makes Region scanning faster at the cost of more CPU consumption. - A lot of tables have been created and then emptied (including truncated tables). These empty Regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: @@ -262,8 +263,8 @@ Similar to the slow scheduling discussed earlier, the speed of Region Merge is m For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of Regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of `max-merge-region-keys`. To avoid this problem, you can adjust `max-merge-region-keys` to a larger value. -### TiKV node troubleshooting +### Troubleshoot TiKV node -If a TiKV node fails, after 30 minutes (customizable by configuration item `max-store-down-time`), PD defaults to setting the corresponding node to "Down" state, and rebalancing replicas for Regions involved. +If a TiKV node fails, PD defaults to setting the corresponding node to the **Down** state after 30 minutes (customizable by configuration item `max-store-down-time`), and rebalancing replicas for Regions involved. -Practically, if a node is deemed unrecoverable, you can immediately take it offline. This makes PD rebalance replicas soon and reduces the risk of data loss. In contrast, if a node is deemed recoverable, but might not be available in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout. \ No newline at end of file +Practically, if a node failure is considered unrecoverable, you can immediately take it offline. This makes PD replenish replicas soon in another node and reduces the risk of data loss. In contrast, if a node is considered recoverable, but the recovery cannot be done in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout. \ No newline at end of file From 433c58db1142be2c07ce39db96d69083ba91dfb6 Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Mon, 9 Dec 2019 11:57:55 +0800 Subject: [PATCH 6/8] address comment, and modify capitalizaiton --- dev/glossary.md | 52 ++-- dev/reference/best-practices/pd-scheduling.md | 228 +++++++++--------- 2 files changed, 140 insertions(+), 140 deletions(-) diff --git a/dev/glossary.md b/dev/glossary.md index 65f03bdf950f6..928d038df78ae 100644 --- a/dev/glossary.md +++ b/dev/glossary.md @@ -8,60 +8,60 @@ category: glossary ## L -### Leader/Follower/Learner +### leader/follower/learner -Leader/Follower/Learner each corresponds to a role in a Raft group of [Peers](#regionpeerraft-group). The Leader services all client requests and replicates data to the Followers. If the group Leader fails, one of the Followers will be elected as the new Leader. Learners are non-voting Followers that only serves in the process of replica addition. +Leader/Follower/Learner each corresponds to a role in a Raft group of [peers](#regionpeerraft-group). The leader services all client requests and replicates data to the followers. If the group leader fails, one of the followers will be elected as the new leader. Learners are non-voting followers that only serves in the process of replica addition. ## O ### Operator -An Operator is a collection of actions that applies to a Region for scheduling purposes. Operators perform scheduling tasks such as "migrate the Leader of Region 2 to Store 5" and "migrate replicas of Region 2 to Store 1, 4, 5". +An operator is a collection of actions that applies to a region for scheduling purposes. Operators perform scheduling tasks such as "migrate the leader of Region 2 to Store 5" and "migrate replicas of Region 2 to Store 1, 4, 5". -An Operator can be computed and generated by a Scheduler, or created by an external API. +An operator can be computed and generated by a [scheduler](#scheduler), or created by an external API. -### Operator Step +### Operator step -An Operator Step is a step in the execution of an Operator. An Operator normally contains multiple Operator steps. +An Operator step is a step in the execution of an Operator. An operator normally contains multiple Operator steps. -Currently, available Steps generated by PD include: +Currently, available steps generated by PD include: -- `TransferLeader`: Transfers Leadership to a specified member -- `AddPeer`: Adds Peers to a specified Store -- `RemovePeer`: Removes a Peer of a Region -- `AddLearner`: Adds Learners to a specified Store -- `PromoteLearner`: Promotes a specified Learner to a voting member -- `SplitRegion`: Splits a specified Region into two +- `TransferLeader`: Transfers leadership to a specified member +- `AddPeer`: Adds peers to a specified store +- `RemovePeer`: Removes a peer of a region +- `AddLearner`: Adds learners to a specified store +- `PromoteLearner`: Promotes a specified learner to a voting member +- `SplitRegion`: Splits a specified region into two ## P -### `Pending`/`Down` +### pending/down -`Pending` and `Down` are two special states of a Peer. `Pending` indicates that the Raft log of Followers or Learners is vastly different from that of Leader. Followers in `Pending` cannot be elected as Leader. `Down` refers to a state that a Peer ceases to respond to Leader for a long time, which usually means the corresponding node is down or isolated from the network. +"Pending" and "down" are two special states of a peer. Pending indicates that the Raft log of followers or learners is vastly different from that of leader. Followers in pending cannot be elected as leader. "Down" refers to a state that a peer ceases to respond to leader for a long time, which usually means the corresponding node is down or isolated from the network. ## R -### Region/Peer/Raft Group +### region/peer/Raft group -Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each Region has three replicas by default. A replica of a Region is called a Peer. Multiple Peers of the same Region replicate data via the Raft protocol, so Peers are also members of a Raft instance. TiKV uses Multi-Raft to manage data. That is, for each Region, there is a corresponding, isolated Raft Group. +Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each region has three replicas by default. A replica of a region is called a peer. Multiple peers of the same region replicate data via the Raft consensus algorithm, so peers are also members of a Raft instance. TiKV uses Multi-Raft to manage data. That is, for each region, there is a corresponding, isolated Raft group. -### Region Split +### region split -Regions are generated as data writes increase. The process of splitting is called Region Split. +Regions are generated as data writes increase. The process of splitting is called region split. -The mechanism of Region Split is to use one initial Region to cover the entire key space, and generate new Regions through splitting existing ones every time the size of the Region or the number of keys has reached a threshold. +The mechanism of region split is to use one initial region to cover the entire key space, and generate new regions through splitting existing ones every time the size of the region or the number of keys has reached a threshold. ## S -### Scheduler +### scheduler Schedulers are components in PD that generate scheduling tasks. Each scheduler in PD runs independently and serves different purposes. The commonly used schedulers are: -- `balance-leader-scheduler`: Balances the distribution of Leaders -- `balance-region-scheduler`: Balances the distribution of Peers -- `hot-region-scheduler`: Balances the distribution of hot Regions -- `evict-leader-{store-id}`: Evicts all Leaders of a node (often used for rolling upgrades) +- `balance-leader-scheduler`: Balances the distribution of leaders +- `balance-region-scheduler`: Balances the distribution of peers +- `hot-region-scheduler`: Balances the distribution of hot regions +- `evict-leader-{store-id}`: Evicts all leaders of a node (often used for rolling upgrades) ### Store -A Store refers to the storage node in the TiKV cluster (an instance of `tikv-server`). Each Store has a corresponding TiKV instance. \ No newline at end of file +A store refers to the storage node in the TiKV cluster (an instance of `tikv-server`). Each store has a corresponding TiKV instance. \ No newline at end of file diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md index 929db88530151..a31a266a9a9e6 100644 --- a/dev/reference/best-practices/pd-scheduling.md +++ b/dev/reference/best-practices/pd-scheduling.md @@ -8,14 +8,14 @@ category: reference This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: -- [Leader/Follower/Learner](/dev/glossary.md#leaderfollowerlearner) -- [Operator](/dev/glossary.md#operator) -- [Operator Step](/dev/glossary.md#operator-step) -- [Pending/Down](/dev/glossary.md#pendingdown) -- [Region/Peer/Raft Group](/dev/glossary.md#regionpeerraft-group) -- [Region Split](/dev/glossary.md#region-split) -- [Scheduler](/dev/glossary.md#scheduler) -- [Store](/dev/glossary.md#store) +- [leader/follower/learner](/dev/glossary.md#leaderfollowerlearner) +- [operator](/dev/glossary.md#operator) +- [operator step](/dev/glossary.md#operator-step) +- [pending/down](/dev/glossary.md#pendingdown) +- [region/peer/Raft group](/dev/glossary.md#regionpeerraft-group) +- [region split](/dev/glossary.md#region-split) +- [scheduler](/dev/glossary.md#scheduler) +- [store](/dev/glossary.md#store) > **Note:** > @@ -27,183 +27,183 @@ This section introduces the principles and processes involved in the scheduling ### Scheduling process -The Scheduling process generally has three steps: +The scheduling process generally has three steps: 1. Collect information Each TiKV node periodically reports two types of heartbeats to PD: - - `StoreHeartbeat`: Contains the overall information of Stores, including disk capacity, available storage, and read/write traffic. - - `RegionHeartbeat`: Contains the overall information of Regions, including the range of each Region, peer distribution, peer status, data volume, and read/write traffic. + - `StoreHeartbeat`: Contains the overall information of stores, including disk capacity, available storage, and read/write traffic + - `RegionHeartbeat`: Contains the overall information of regions, including the range of each region, peer distribution, peer status, data volume, and read/write traffic PD collects and restores this information for scheduling decisions. -2. Generate Operators +2. Generate operators - Different schedulers generate the Operators based on their own logic and requirements, with constraints such as: + Different schedulers generate the operators based on their own logic and requirements, with the following considerations: - - Do not add Peers to a Store in abnormal states (disconnected, down, busy, out of space) - - Do not balance Regions in abnormal states - - Do not transfer a Leader to a Pending Peer - - Do not remove a Leader directly - - Do not break the physical isolation of various Region Peers - - Do not violate constraints such as Label property + - Do not add peers to a store in abnormal states (disconnected, down, busy, low space) + - Do not balance regions in abnormal states + - Do not transfer a leader to a pending peer + - Do not remove a leader directly + - Do not break the physical isolation of various region peers + - Do not violate constraints such as label property -3. Execute Operators +3. Execute operators - To execute the Operators, the general procedure is: + To execute the operators, the general procedure is: - 1. The generated Operator first joins a queue managed by `OperatorController`. + 1. The generated operator first joins a queue managed by `OperatorController`. - 2. `OperatorController` takes the Operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to assign each Operator Step to the corresponding Region Leader. + 2. `OperatorController` takes the operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to assign each operator step to the corresponding region leader. - 3. The Operator is marked as "finish" or "timeout" and removed from the queue. + 3. The operator is marked as "finish" or "timeout" and removed from the queue. ### Load balancing -Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing Regions evenly across all Stores in the cluster but with separate focuses: `balance-leader` deals with Region Leader to balance incoming client requests, whereas `balance-region` concerns itself with each Region Peer to redistribute the pressure of storage and avoid exceptions like out of storage space. +Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing regions evenly across all stores in the cluster but with separate focuses: `balance-leader` deals with region leader to balance incoming client requests, whereas `balance-region` concerns itself with each region peer to redistribute the pressure of storage and avoid exceptions like out of storage space. `balance-leader` and `balance-region` share a similar scheduling process: -1. Rate Stores according to their resource availability. -2. `balance-leader` or `balance-region` constantly transfer Leaders or Peers from Stores with high scores to those with low scores. +1. Rate stores according to their resource availability. +2. `balance-leader` or `balance-region` constantly transfer leaders or peers from stores with high scores to those with low scores. -However, their rating methods are different. `balance-leader` uses the sum of all Region Sizes corresponding to Leaders in a Store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might: +However, their rating methods are different. `balance-leader` uses the sum of all region sizes corresponding to leaders in a store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might: - based on the amount of data when there is sufficient storage (to balance data distribution among nodes). - based on the available storage when there is insufficient storage (to balance the storage availability on different nodes). - based on the weighted sum of the two factors above when neither of the situations applies. -Because different nodes might differ in performance, you can also set the weight of load balancing for different Stores. `leader-weight` and `region-weight` are used to control the Leader weight and Region weight respectively ("1" by default for both). For example, when the `leader-weight` of a Store is set to "2", the number of Leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a Store is set to "0.5", the number of Leaders on the node is about half as many as that of other nodes. +Because different nodes might differ in performance, you can also set the weight of load balancing for different stores. `leader-weight` and `region-weight` are used to control the leader weight and region weight respectively ("1" by default for both). For example, when the `leader-weight` of a store is set to "2", the number of leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a store is set to "0.5", the number of leaders on the node is about half as many as that of other nodes. -### Hot Regions scheduling +### Hot regions scheduling -For hot Regions scheduling, use `hot-region-scheduler`. Currently in TiDB 3.0, the process is performed as follows: +For hot regions scheduling, use `hot-region-scheduler`. Currently in TiDB 3.0, the process is performed as follows: -1. Count hot Regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by Stores. +1. Count hot regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by stores. -2. Redistribute these Regions in a similar way to load balancing. +2. Redistribute these regions in a similar way to load balancing. -For hot write Regions, `hot-region-scheduler` attempts to redistribute both Region Peers and Leaders; for hot read Regions, `hot-region-scheduler` only redistributes Region Leaders. +For hot write regions, `hot-region-scheduler` attempts to redistribute both region peers and leaders; for hot read regions, `hot-region-scheduler` only redistributes region leaders. ### Cluster topology awareness -Cluster topology awareness enables PD to distribute replicas of a Region as much as possible. This is how TiKV ensures high availability and disaster recovery capability. PD continuously scans all Regions in the background. When PD finds that the distribution of Regions is not optimal, it generates an Operator to replace Peers and redistribute Regions. +Cluster topology awareness enables PD to distribute replicas of a region as much as possible. This is how TiKV ensures high availability and disaster recovery capability. PD continuously scans all regions in the background. When PD finds that the distribution of regions is not optimal, it generates an operator to replace peers and redistribute regions. -The component to check Region distribution is `replicaChecker`, which is similar to Scheduler except that it cannot be disabled. `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone, rack, host]` defines a three-tier topology for a cluster. PD attempts to schedule Region Peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. +The component to check region distribution is `replicaChecker`, which is similar to a scheduler except that it cannot be disabled. `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone,rack,host]` defines a three-tier topology for a cluster. PD attempts to schedule region peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. ### Scale-down and failure recovery -Scale-down refers to the process when you take a Store offline and mark it as "offline" using a command. PD replicates the Regions on the offline node to other nodes by scheduling. Failure recovery applies when Stores failed and cannot be recovered. In this case, Regions with Peers distributed on the corresponding Store might lose replicas, which requires PD to replenish on other nodes. +Scale-down refers to the process when you take a store offline and mark it as "offline" using a command. PD replicates the regions on the offline node to other nodes by scheduling. Failure recovery applies when stores failed and cannot be recovered. In this case, regions with peers distributed on the corresponding store might lose replicas, which requires PD to replenish on other nodes. -The processes of Scale-down and failure recovery are basically the same. `replicaChecker` finds a Region Peer in abnormal states, and then generates an Operator to replace the abnormal Peer with a new one on a healthy Store. +The processes of scale-down and failure recovery are basically the same. `replicaChecker` finds a region peer in abnormal states, and then generates an operator to replace the abnormal peer with a new one on a healthy store. -### Region Merge +### Region merge -Region Merge refers to the process of merging adjacent small Regions. It serves to avoid unnecessary resource consumption by a large number of small or even empty Regions after data deletion. Region Merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all Regions in the background, and generates an Operator when contiguous small Regions are found. +Region merge refers to the process of merging adjacent small regions. It serves to avoid unnecessary resource consumption by a large number of small or even empty regions after data deletion. Region merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all regions in the background, and generates an operator when contiguous small regions are found. ## Query scheduling status -You can check the status of scheduling system through Metrics, pd-ctl and logs. This section briefly introduces the methods of Metrics and pd-ctl. Refer to [PD monitoring metrics](/dev/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control](/dev/reference/tools/pd-control.md) for details. +You can check the status of scheduling system through metrics, pd-ctl and logs. This section briefly introduces the methods of metrics and pd-ctl. Refer to [PD Monitoring Metrics](/dev/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control](/dev/reference/tools/pd-control.md) for details. ### Operator status -The **Grafana PD/Operator** page shows the statistics about Operators, among which: +The **Grafana PD/Operator** page shows the metrics about operators, among which: -- Schedule Operator Create: Operator creating information -- Operator finish duration: Execution time consumed by each Operator -- Operator Step duration: Execution time consumed by the Operator Step +- Schedule operator create: Operator creating information +- Operator finish duration: Execution time consumed by each operator +- Operator step duration: Execution time consumed by the operator step -You can query Operators using pd-ctl with the following commands: +You can query operators using pd-ctl with the following commands: -- `operator show`: Queries all Operators generated in the current scheduling task -- `operator show [admin | leader | region]`: Queries Operators by type +- `operator show`: Queries all operators generated in the current scheduling task +- `operator show [admin | leader | region]`: Queries operators by type ### Balance status -The **Grafana PD/Statistics - Balance** page shows the statistics about load balancing, among which: +The **Grafana PD/Statistics - Balance** page shows the metrics about load balancing, among which: -- Store Leader/Region score: Score of each Store -- Store Leader/Region count: The number of Leaders/Regions in each Store -- Store available: Available storage on each Store +- Store leader/region score: Score of each store +- Store leader/region count: The number of leaders/regions in each store +- Store available: Available storage on each store -You can use store commands of pd-ctl to query balance status of each Store. +You can use store commands of pd-ctl to query balance status of each store. ### Hot Region status -The **Grafana PD/Statistics - hotspot** page shows the statistics about hot Regions, among which: +The **Grafana PD/Statistics - hotspot** page shows the metrics about hot regions, among which: -- Hot write Region’s leader/peer distribution: Leader/Peer distribution in hot write Regions -- Hot read Region’s leader distribution: Leader distribution in hot read Regions +- Hot write region’s leader/peer distribution: the leader/peer distribution in hot write regions +- Hot read region’s leader distribution: the leader distribution in hot read regions -You can also query the status of hot Regions using pd-ctl with the following commands: +You can also query the status of hot regions using pd-ctl with the following commands: -- `hot read`: Queries hot read Regions -- `hot write`: Queries hot write Regions -- `hot store`: Queries the distribution of hot Regions by Store -- `region topread [limit]`: Queries the Region with top read traffic -- `region topwrite [limit]`: Queries the Region with top write traffic +- `hot read`: Queries hot read regions +- `hot write`: Queries hot write regions +- `hot store`: Queries the distribution of hot regions by store +- `region topread [limit]`: Queries the region with top read traffic +- `region topwrite [limit]`: Queries the region with top write traffic ### Region health -The **Grafana PD/Cluster/Region health** panel shows the statistics about Regions in abnormal states, include Pending Peer, Down Peer, Offline Peer and Regions with extra or few Peers. +The **Grafana PD/Cluster/Region health** panel shows the metrics about regions in abnormal states. -You can query the list of Regions in abnormal states using pd-ctl with region check commands: +You can query the list of regions in abnormal states using pd-ctl with region check commands: -- `region check miss-peer`: Queries Regions without enough Peers -- `region check extra-peer`: Queries Regions with extra Peers -- `region check down-peer`: Queries Regions with Down Peers -- `region check pending-peer`: Queries Regions with Pending Peers +- `region check miss-peer`: Queries regions without enough peers +- `region check extra-peer`: Queries regions with extra peers +- `region check down-peer`: Queries regions with down peers +- `region check pending-peer`: Queries regions with pending peers ## Control scheduling strategy You can use pd-ctl to adjust the scheduling strategy from the following three aspects. Refer to [PD Control](/dev/reference/tools/pd-control.md) for more details. -### Add/delete Scheduler manually +### Add/delete scheduler manually -PD supports dynamically adding and removing Schedulers directly through pd-ctl. For example: +PD supports dynamically adding and removing schedulers directly through pd-ctl. For example: -- `scheduler show`: Shows currently running Schedulers in the system +- `scheduler show`: Shows currently running schedulers in the system - `scheduler remove balance-leader-scheduler`: Removes (disable) balance-leader-scheduler -- `scheduler add evict-leader-scheduler-1`: Adds a scheduler to remove all Leaders in Store 1 +- `scheduler add evict-leader-scheduler-1`: Adds a scheduler to remove all leaders in Store 1 ### Add/delete Operators manually PD also supports adding or removing Operators directly through pd-ctl. For example: -- `operator add add-peer 2 5`: Adds Peers to Region 2 in Store 5 -- `operator add transfer-leader 2 5`: Migrates the Leader of Region 2 to Store 5 -- `operator add split-region 2`: Splits Region 2 into two Regions evenly in size -- `operator remove 2`: Removes currently pending Operator in Region 2 +- `operator add add-peer 2 5`: Adds peers to Region 2 in Store 5 +- `operator add transfer-leader 2 5`: Migrates the leader of Region 2 to Store 5 +- `operator add split-region 2`: Splits Region 2 into two regions evenly in size +- `operator remove 2`: Removes currently pending operator in Region 2 ### Adjust scheduling parameter You can check the scheduling configuration using the `config show` command in pd-ctl, and adjust the values using `config set {key} {value}`. Common adjustments include: -- `leader-schedule-limit`: Controls the concurrency of Transfer Leader scheduling -- `region-schedule-limit`: Controls the concurrency of adding/deleting Peer scheduling +- `leader-schedule-limit`: Controls the concurrency of transferring leader scheduling +- `region-schedule-limit`: Controls the concurrency of adding/deleting peer scheduling - `disable-replace-offline-replica`: Determines whether to disable the scheduling to take nodes offline -- `disable-location-replacement`: Determines whether to disable the scheduling that handles the isolation level of Regions -- `max-snapshot-count`: Controls the maximum concurrency of sending/receiving Snapshots for each Store +- `disable-location-replacement`: Determines whether to disable the scheduling that handles the isolation level of regions +- `max-snapshot-count`: Controls the maximum concurrency of sending/receiving snapshots for each store ## PD scheduling in common scenarios This section illustrates the best practices of PD scheduling strategies through several typical scenarios. -### Leaders/Regions are not evenly distributed +### Leaders/regions are not evenly distributed -The rating mechanism of PD determines that Leader Count and Region Count of different Stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or Storage usage. +The rating mechanism of PD determines that leader count and region count of different stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or storage usage. -Once you have confirmed that Leaders/Regiosn are not evenly distributed, you need to check the rating of different Stores. +Once you have confirmed that leaders/region are not evenly distributed, you need to check the rating of different stores. -If the scores of different Stores are close, it means PD mistakenly believes that Leaders/Regions are evenly distributed. Possible reasons are: +If the scores of different stores are close, it means PD mistakenly believes that leaders/regions are evenly distributed. Possible reasons are: -- There are hot Regions that cause load imbalancing. In this case, you need to analyze further based on [hot Regions scheduling](#hot-regions-are-not-evenly-distributed). -- There are a large number of empty Regions or small Regions, which leads to a great difference in the number of Leaders in different Stores and high pressure on Raftstore. This is the time for a [Region Merge](#region-merge-is-slow) scheduling. -- Hardware and software environment varies among Stores. You can adjust the values of `leader-weight` and `region-weight` accordingly to control the distribution of Leader/Region. -- Other unknown reasons. Still you can adjust the values of `leader-weight` and `region-weight` to control the distribution of Leader/Region. +- There are hot regions that cause load imbalancing. In this case, you need to analyze further based on [hot regions scheduling](#hot-regions-are-not-evenly-distributed). +- There are a large number of empty regions or small regions, which leads to a great difference in the number of leaders in different stores and high pressure on Raft store. This is the time for a [region merge](#region-merge-is-slow) scheduling. +- Hardware and software environment varies among stores. You can adjust the values of `leader-weight` and `region-weight` accordingly to control the distribution of leader/region. +- Other unknown reasons. Still you can adjust the values of `leader-weight` and `region-weight` to control the distribution of leader/region. -If there is a big difference in the rating of different Stores, you need to examine the Operator-related metrics, with special focus on the generation and execution of Operators. There are two main situations: +If there is a big difference in the rating of different stores, you need to examine the operator-related metrics, with special focus on the generation and execution of operators. There are two main situations: - When Operators are generated normally but the scheduling process is slow, it is possible that: @@ -211,60 +211,60 @@ If there is a big difference in the rating of different Stores, you need to exam - Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of `region-schedule-limit`. In this case, you can limit the speed of scheduler to remove nodes, or simply set `disable-replace-offline-replica = true` to temporarily disable it. - The scheduling process is too slow. You can check the **Operator Step duration** metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as `TransferLeader`, `RemovePeer`, `PromoteLearner`) should be completed in milliseconds, while steps that involve snapshots (such as `AddLearner` and `AddPeer`) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis. -- PD fails to generate the corresponding balancing Scheduler. Possible reasons include: +- PD fails to generate the corresponding balancing scheduler. Possible reasons include: - - The Scheduler is not activated. For example, the corresponding Scheduler is deleted, or its limit it set to "0". - - Other constraints. For example, `evict-leader-scheduler` in the system prevents Leaders from being migrating to the corresponding Store. Or Label property is set, which makes some Stores reject Leaders. - - Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each Region are distributed in different data centers due to replica isolation. If the number of Stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally. + - The scheduler is not activated. For example, the corresponding scheduler is deleted, or its limit it set to "0". + - Other constraints. For example, `evict-leader-scheduler` in the system prevents leaders from being migrating to the corresponding store. Or label property is set, which makes some stores reject leaders. + - Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each region are distributed in different data centers due to replica isolation. If the number of stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally. ### Taking nodes offline is slow -This scenario requires examining the generation and execution of Operators through related metrics. +This scenario requires examining the generation and execution of operators through related metrics. -If Operators are successfully generated but the scheduling process is slow, possible reasons are: +If operators are successfully generated but the scheduling process is slow, possible reasons are: - The scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to larger value.s Similarly, you can consider loosening the limits on `max-pending-peer-count` and `max-snapshot-count`. -- Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in [the previous section](#leadersregions-are-not-evenly-distributed). -- When you take a single node offline, a number of Region Leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an `evict-leader-scheduler` to migrate Leaders. +- Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed). +- When you take a single node offline, a number of region leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an `evict-leader-scheduler` to migrate leaders. -If the corresponding Operator fails to generate, possible reasons are: +If the corresponding operator fails to generate, possible reasons are: -- The Operator is stopped, or `replica-schedule-limit` is set to "0". -- There is no proper node for Region migration. For example, if the available capacity size of the replacing nodes (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage space. In such case, you need to add more nodes or delete some data to free the space. +- The operator is stopped, or `replica-schedule-limit` is set to "0". +- There is no proper node for region migration. For example, if the available capacity size of the replacing node (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage on that node. In such case, you need to add more nodes or delete some data to free the space. ### Bringing nodes online is slow -Currently, bringing nodes online is scheduled through the balance region mechanism. You can refer to [Leaders/Regions are not evenly distributed](#leadersregions-are-not-evenly-distributed) for troubleshooting. +Currently, bringing nodes online is scheduled through the balance region mechanism. You can refer to [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed) for troubleshooting. -### Hot Regions are not evenly distributed +### Hot regions are not evenly distributed -Hot Regions scheduling issues generally fall into the following categories: +Hot regions scheduling issues generally fall into the following categories: -- Hot Regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot Regions in time. +- Hot regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot regions in time. - **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot Regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD more sensitive to traffic changes. + **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD more sensitive to traffic changes. -- Hotspot formed on a single Region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a `split-region` Operator to split such a Region. +- Hotspot formed on a single region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a `split-region` operator to split such a region. -- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some Regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules. +- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules. - **Solutions**: Firstly, locate the table where hot Regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all Regions of this table evenly distributed. TiDB also provides an interface in its HTTP AIP to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. + **Solutions**: Firstly, locate the table where hot regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all Regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. -### Region Merge is slow +### Region merge is slow -Similar to slow scheduling, the speed of Region Merge is most likely limited by the configurations of `merge-schedule-limit` and `region-schedule-limit`, or the Region Merge scheduler is competing with other schedulers. Specifically, the solutions are: +Similar to slow scheduling, the speed of region merge is most likely limited by the configurations of `merge-schedule-limit` and `region-schedule-limit`, or the region merge scheduler is competing with other schedulers. Specifically, the solutions are: -- If it is known from statistics that there are a large number of empty Regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the Region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can set `patrol-region-interval` to `10ms`. This makes Region scanning faster at the cost of more CPU consumption. +- If it is known from metrics that there are a large number of empty regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can set `patrol-region-interval` to `10ms`. This makes region scanning faster at the cost of more CPU consumption. -- A lot of tables have been created and then emptied (including truncated tables). These empty Regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: +- A lot of tables have been created and then emptied (including truncated tables). These empty regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: - TiKV: set `split-region-on-table` to `false` - PD: set `namespace-classifier` to "" -For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of Regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of `max-merge-region-keys`. To avoid this problem, you can adjust `max-merge-region-keys` to a larger value. +For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of `max-merge-region-keys`. To avoid this problem, you can adjust `max-merge-region-keys` to a larger value. ### Troubleshoot TiKV node -If a TiKV node fails, PD defaults to setting the corresponding node to the **Down** state after 30 minutes (customizable by configuration item `max-store-down-time`), and rebalancing replicas for Regions involved. +If a TiKV node fails, PD defaults to setting the corresponding node to the **down** state after 30 minutes (customizable by configuration item `max-store-down-time`), and rebalancing replicas for regions involved. Practically, if a node failure is considered unrecoverable, you can immediately take it offline. This makes PD replenish replicas soon in another node and reduces the risk of data loss. In contrast, if a node is considered recoverable, but the recovery cannot be done in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout. \ No newline at end of file From e53e799f1c052af5ca30b9ca7dfbe30c7cc5a00a Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Mon, 9 Dec 2019 14:28:25 +0800 Subject: [PATCH 7/8] update v2.1, v3.0 and v3.1 --- dev/glossary.md | 2 +- dev/reference/best-practices/pd-scheduling.md | 8 +- v2.1/glossary.md | 67 +++++ .../reference/best-practices/pd-scheduling.md | 270 ++++++++++++++++++ v3.0/glossary.md | 67 +++++ .../reference/best-practices/pd-scheduling.md | 270 ++++++++++++++++++ v3.1/glossary.md | 67 +++++ .../reference/best-practices/pd-scheduling.md | 270 ++++++++++++++++++ 8 files changed, 1016 insertions(+), 5 deletions(-) create mode 100644 v2.1/glossary.md create mode 100644 v2.1/reference/best-practices/pd-scheduling.md create mode 100644 v3.0/glossary.md create mode 100644 v3.0/reference/best-practices/pd-scheduling.md create mode 100644 v3.1/glossary.md create mode 100644 v3.1/reference/best-practices/pd-scheduling.md diff --git a/dev/glossary.md b/dev/glossary.md index 928d038df78ae..dd1d60b829501 100644 --- a/dev/glossary.md +++ b/dev/glossary.md @@ -22,7 +22,7 @@ An operator can be computed and generated by a [scheduler](#scheduler), or creat ### Operator step -An Operator step is a step in the execution of an Operator. An operator normally contains multiple Operator steps. +An operator step is a step in the execution of an operator. An operator normally contains multiple Operator steps. Currently, available steps generated by PD include: diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md index a31a266a9a9e6..1089203f10227 100644 --- a/dev/reference/best-practices/pd-scheduling.md +++ b/dev/reference/best-practices/pd-scheduling.md @@ -169,7 +169,7 @@ PD supports dynamically adding and removing schedulers directly through pd-ctl. ### Add/delete Operators manually -PD also supports adding or removing Operators directly through pd-ctl. For example: +PD also supports adding or removing operators directly through pd-ctl. For example: - `operator add add-peer 2 5`: Adds peers to Region 2 in Store 5 - `operator add transfer-leader 2 5`: Migrates the leader of Region 2 to Store 5 @@ -205,11 +205,11 @@ If the scores of different stores are close, it means PD mistakenly believes tha If there is a big difference in the rating of different stores, you need to examine the operator-related metrics, with special focus on the generation and execution of operators. There are two main situations: -- When Operators are generated normally but the scheduling process is slow, it is possible that: +- When operators are generated normally but the scheduling process is slow, it is possible that: - The scheduling speed is limited by default for load balancing purpose. You can adjust `leader-schedule-limit` or `region-schedule-limit` to larger values without significantly impacting regular services. In addition, you can also properly ease the restrictions specified by `max-pending-peer-count` and `max-snapshot-count`. - Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of `region-schedule-limit`. In this case, you can limit the speed of scheduler to remove nodes, or simply set `disable-replace-offline-replica = true` to temporarily disable it. - - The scheduling process is too slow. You can check the **Operator Step duration** metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as `TransferLeader`, `RemovePeer`, `PromoteLearner`) should be completed in milliseconds, while steps that involve snapshots (such as `AddLearner` and `AddPeer`) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis. + - The scheduling process is too slow. You can check the **Operator step duration** metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as `TransferLeader`, `RemovePeer`, `PromoteLearner`) should be completed in milliseconds, while steps that involve snapshots (such as `AddLearner` and `AddPeer`) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis. - PD fails to generate the corresponding balancing scheduler. Possible reasons include: @@ -248,7 +248,7 @@ Hot regions scheduling issues generally fall into the following categories: - The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules. - **Solutions**: Firstly, locate the table where hot regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all Regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. + **Solutions**: Firstly, locate the table where hot regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. ### Region merge is slow diff --git a/v2.1/glossary.md b/v2.1/glossary.md new file mode 100644 index 0000000000000..dd1d60b829501 --- /dev/null +++ b/v2.1/glossary.md @@ -0,0 +1,67 @@ +--- +title: Glossary +summary: Glossaries about TiDB. +category: glossary +--- + +# Glossary + +## L + +### leader/follower/learner + +Leader/Follower/Learner each corresponds to a role in a Raft group of [peers](#regionpeerraft-group). The leader services all client requests and replicates data to the followers. If the group leader fails, one of the followers will be elected as the new leader. Learners are non-voting followers that only serves in the process of replica addition. + +## O + +### Operator + +An operator is a collection of actions that applies to a region for scheduling purposes. Operators perform scheduling tasks such as "migrate the leader of Region 2 to Store 5" and "migrate replicas of Region 2 to Store 1, 4, 5". + +An operator can be computed and generated by a [scheduler](#scheduler), or created by an external API. + +### Operator step + +An operator step is a step in the execution of an operator. An operator normally contains multiple Operator steps. + +Currently, available steps generated by PD include: + +- `TransferLeader`: Transfers leadership to a specified member +- `AddPeer`: Adds peers to a specified store +- `RemovePeer`: Removes a peer of a region +- `AddLearner`: Adds learners to a specified store +- `PromoteLearner`: Promotes a specified learner to a voting member +- `SplitRegion`: Splits a specified region into two + +## P + +### pending/down + +"Pending" and "down" are two special states of a peer. Pending indicates that the Raft log of followers or learners is vastly different from that of leader. Followers in pending cannot be elected as leader. "Down" refers to a state that a peer ceases to respond to leader for a long time, which usually means the corresponding node is down or isolated from the network. + +## R + +### region/peer/Raft group + +Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each region has three replicas by default. A replica of a region is called a peer. Multiple peers of the same region replicate data via the Raft consensus algorithm, so peers are also members of a Raft instance. TiKV uses Multi-Raft to manage data. That is, for each region, there is a corresponding, isolated Raft group. + +### region split + +Regions are generated as data writes increase. The process of splitting is called region split. + +The mechanism of region split is to use one initial region to cover the entire key space, and generate new regions through splitting existing ones every time the size of the region or the number of keys has reached a threshold. + +## S + +### scheduler + +Schedulers are components in PD that generate scheduling tasks. Each scheduler in PD runs independently and serves different purposes. The commonly used schedulers are: + +- `balance-leader-scheduler`: Balances the distribution of leaders +- `balance-region-scheduler`: Balances the distribution of peers +- `hot-region-scheduler`: Balances the distribution of hot regions +- `evict-leader-{store-id}`: Evicts all leaders of a node (often used for rolling upgrades) + +### Store + +A store refers to the storage node in the TiKV cluster (an instance of `tikv-server`). Each store has a corresponding TiKV instance. \ No newline at end of file diff --git a/v2.1/reference/best-practices/pd-scheduling.md b/v2.1/reference/best-practices/pd-scheduling.md new file mode 100644 index 0000000000000..e93ea90b45bb5 --- /dev/null +++ b/v2.1/reference/best-practices/pd-scheduling.md @@ -0,0 +1,270 @@ +--- +title: PD Scheduling +summary: Learn best practice and strategy for PD scheduling. +category: reference +--- + +# PD Scheduling + +This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: + +- [leader/follower/learner](/v2.1/glossary.md#leaderfollowerlearner) +- [operator](/v2.1/glossary.md#operator) +- [operator step](/v2.1/glossary.md#operator-step) +- [pending/down](/v2.1/glossary.md#pendingdown) +- [region/peer/Raft group](/v2.1/glossary.md#regionpeerraft-group) +- [region split](/v2.1/glossary.md#region-split) +- [scheduler](/v2.1/glossary.md#scheduler) +- [store](/v2.1/glossary.md#store) + +> **Note:** +> +> This document initially targets TiDB 3.0. Although some features are not supported in earlier versions (2.x), the underlying mechanisms are similar and this document can still be used as a reference. + +## PD scheduling policies + +This section introduces the principles and processes involved in the scheduling system. + +### Scheduling process + +The scheduling process generally has three steps: + +1. Collect information + + Each TiKV node periodically reports two types of heartbeats to PD: + + - `StoreHeartbeat`: Contains the overall information of stores, including disk capacity, available storage, and read/write traffic + - `RegionHeartbeat`: Contains the overall information of regions, including the range of each region, peer distribution, peer status, data volume, and read/write traffic + + PD collects and restores this information for scheduling decisions. + +2. Generate operators + + Different schedulers generate the operators based on their own logic and requirements, with the following considerations: + + - Do not add peers to a store in abnormal states (disconnected, down, busy, low space) + - Do not balance regions in abnormal states + - Do not transfer a leader to a pending peer + - Do not remove a leader directly + - Do not break the physical isolation of various region peers + - Do not violate constraints such as label property + +3. Execute operators + + To execute the operators, the general procedure is: + + 1. The generated operator first joins a queue managed by `OperatorController`. + + 2. `OperatorController` takes the operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to assign each operator step to the corresponding region leader. + + 3. The operator is marked as "finish" or "timeout" and removed from the queue. + +### Load balancing + +Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing regions evenly across all stores in the cluster but with separate focuses: `balance-leader` deals with region leader to balance incoming client requests, whereas `balance-region` concerns itself with each region peer to redistribute the pressure of storage and avoid exceptions like out of storage space. + +`balance-leader` and `balance-region` share a similar scheduling process: + +1. Rate stores according to their resource availability. +2. `balance-leader` or `balance-region` constantly transfer leaders or peers from stores with high scores to those with low scores. + +However, their rating methods are different. `balance-leader` uses the sum of all region sizes corresponding to leaders in a store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might: + +- based on the amount of data when there is sufficient storage (to balance data distribution among nodes). +- based on the available storage when there is insufficient storage (to balance the storage availability on different nodes). +- based on the weighted sum of the two factors above when neither of the situations applies. + +Because different nodes might differ in performance, you can also set the weight of load balancing for different stores. `leader-weight` and `region-weight` are used to control the leader weight and region weight respectively ("1" by default for both). For example, when the `leader-weight` of a store is set to "2", the number of leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a store is set to "0.5", the number of leaders on the node is about half as many as that of other nodes. + +### Hot regions scheduling + +For hot regions scheduling, use `hot-region-scheduler`. Currently in TiDB 3.0, the process is performed as follows: + +1. Count hot regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by stores. + +2. Redistribute these regions in a similar way to load balancing. + +For hot write regions, `hot-region-scheduler` attempts to redistribute both region peers and leaders; for hot read regions, `hot-region-scheduler` only redistributes region leaders. + +### Cluster topology awareness + +Cluster topology awareness enables PD to distribute replicas of a region as much as possible. This is how TiKV ensures high availability and disaster recovery capability. PD continuously scans all regions in the background. When PD finds that the distribution of regions is not optimal, it generates an operator to replace peers and redistribute regions. + +The component to check region distribution is `replicaChecker`, which is similar to a scheduler except that it cannot be disabled. `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone,rack,host]` defines a three-tier topology for a cluster. PD attempts to schedule region peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. + +### Scale-down and failure recovery + +Scale-down refers to the process when you take a store offline and mark it as "offline" using a command. PD replicates the regions on the offline node to other nodes by scheduling. Failure recovery applies when stores failed and cannot be recovered. In this case, regions with peers distributed on the corresponding store might lose replicas, which requires PD to replenish on other nodes. + +The processes of scale-down and failure recovery are basically the same. `replicaChecker` finds a region peer in abnormal states, and then generates an operator to replace the abnormal peer with a new one on a healthy store. + +### Region merge + +Region merge refers to the process of merging adjacent small regions. It serves to avoid unnecessary resource consumption by a large number of small or even empty regions after data deletion. Region merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all regions in the background, and generates an operator when contiguous small regions are found. + +## Query scheduling status + +You can check the status of scheduling system through metrics, pd-ctl and logs. This section briefly introduces the methods of metrics and pd-ctl. Refer to [PD Monitoring Metrics](/v2.1/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control](/v2.1/reference/tools/pd-control.md) for details. + +### Operator status + +The **Grafana PD/Operator** page shows the metrics about operators, among which: + +- Schedule operator create: Operator creating information +- Operator finish duration: Execution time consumed by each operator +- Operator step duration: Execution time consumed by the operator step + +You can query operators using pd-ctl with the following commands: + +- `operator show`: Queries all operators generated in the current scheduling task +- `operator show [admin | leader | region]`: Queries operators by type + +### Balance status + +The **Grafana PD/Statistics - Balance** page shows the metrics about load balancing, among which: + +- Store leader/region score: Score of each store +- Store leader/region count: The number of leaders/regions in each store +- Store available: Available storage on each store + +You can use store commands of pd-ctl to query balance status of each store. + +### Hot Region status + +The **Grafana PD/Statistics - hotspot** page shows the metrics about hot regions, among which: + +- Hot write region’s leader/peer distribution: the leader/peer distribution in hot write regions +- Hot read region’s leader distribution: the leader distribution in hot read regions + +You can also query the status of hot regions using pd-ctl with the following commands: + +- `hot read`: Queries hot read regions +- `hot write`: Queries hot write regions +- `hot store`: Queries the distribution of hot regions by store +- `region topread [limit]`: Queries the region with top read traffic +- `region topwrite [limit]`: Queries the region with top write traffic + +### Region health + +The **Grafana PD/Cluster/Region health** panel shows the metrics about regions in abnormal states. + +You can query the list of regions in abnormal states using pd-ctl with region check commands: + +- `region check miss-peer`: Queries regions without enough peers +- `region check extra-peer`: Queries regions with extra peers +- `region check down-peer`: Queries regions with down peers +- `region check pending-peer`: Queries regions with pending peers + +## Control scheduling strategy + +You can use pd-ctl to adjust the scheduling strategy from the following three aspects. Refer to [PD Control](/v2.1/reference/tools/pd-control.md) for more details. + +### Add/delete scheduler manually + +PD supports dynamically adding and removing schedulers directly through pd-ctl. For example: + +- `scheduler show`: Shows currently running schedulers in the system +- `scheduler remove balance-leader-scheduler`: Removes (disable) balance-leader-scheduler +- `scheduler add evict-leader-scheduler-1`: Adds a scheduler to remove all leaders in Store 1 + +### Add/delete Operators manually + +PD also supports adding or removing operators directly through pd-ctl. For example: + +- `operator add add-peer 2 5`: Adds peers to Region 2 in Store 5 +- `operator add transfer-leader 2 5`: Migrates the leader of Region 2 to Store 5 +- `operator add split-region 2`: Splits Region 2 into two regions evenly in size +- `operator remove 2`: Removes currently pending operator in Region 2 + +### Adjust scheduling parameter + +You can check the scheduling configuration using the `config show` command in pd-ctl, and adjust the values using `config set {key} {value}`. Common adjustments include: + +- `leader-schedule-limit`: Controls the concurrency of transferring leader scheduling +- `region-schedule-limit`: Controls the concurrency of adding/deleting peer scheduling +- `disable-replace-offline-replica`: Determines whether to disable the scheduling to take nodes offline +- `disable-location-replacement`: Determines whether to disable the scheduling that handles the isolation level of regions +- `max-snapshot-count`: Controls the maximum concurrency of sending/receiving snapshots for each store + +## PD scheduling in common scenarios + +This section illustrates the best practices of PD scheduling strategies through several typical scenarios. + +### Leaders/regions are not evenly distributed + +The rating mechanism of PD determines that leader count and region count of different stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or storage usage. + +Once you have confirmed that leaders/region are not evenly distributed, you need to check the rating of different stores. + +If the scores of different stores are close, it means PD mistakenly believes that leaders/regions are evenly distributed. Possible reasons are: + +- There are hot regions that cause load imbalancing. In this case, you need to analyze further based on [hot regions scheduling](#hot-regions-are-not-evenly-distributed). +- There are a large number of empty regions or small regions, which leads to a great difference in the number of leaders in different stores and high pressure on Raft store. This is the time for a [region merge](#region-merge-is-slow) scheduling. +- Hardware and software environment varies among stores. You can adjust the values of `leader-weight` and `region-weight` accordingly to control the distribution of leader/region. +- Other unknown reasons. Still you can adjust the values of `leader-weight` and `region-weight` to control the distribution of leader/region. + +If there is a big difference in the rating of different stores, you need to examine the operator-related metrics, with special focus on the generation and execution of operators. There are two main situations: + +- When operators are generated normally but the scheduling process is slow, it is possible that: + + - The scheduling speed is limited by default for load balancing purpose. You can adjust `leader-schedule-limit` or `region-schedule-limit` to larger values without significantly impacting regular services. In addition, you can also properly ease the restrictions specified by `max-pending-peer-count` and `max-snapshot-count`. + - Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of `region-schedule-limit`. In this case, you can limit the speed of scheduler to remove nodes, or simply set `disable-replace-offline-replica = true` to temporarily disable it. + - The scheduling process is too slow. You can check the **Operator step duration** metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as `TransferLeader`, `RemovePeer`, `PromoteLearner`) should be completed in milliseconds, while steps that involve snapshots (such as `AddLearner` and `AddPeer`) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis. + +- PD fails to generate the corresponding balancing scheduler. Possible reasons include: + + - The scheduler is not activated. For example, the corresponding scheduler is deleted, or its limit it set to "0". + - Other constraints. For example, `evict-leader-scheduler` in the system prevents leaders from being migrating to the corresponding store. Or label property is set, which makes some stores reject leaders. + - Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each region are distributed in different data centers due to replica isolation. If the number of stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally. + +### Taking nodes offline is slow + +This scenario requires examining the generation and execution of operators through related metrics. + +If operators are successfully generated but the scheduling process is slow, possible reasons are: + +- The scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to larger value.s Similarly, you can consider loosening the limits on `max-pending-peer-count` and `max-snapshot-count`. +- Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed). +- When you take a single node offline, a number of region leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an `evict-leader-scheduler` to migrate leaders. + +If the corresponding operator fails to generate, possible reasons are: + +- The operator is stopped, or `replica-schedule-limit` is set to "0". +- There is no proper node for region migration. For example, if the available capacity size of the replacing node (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage on that node. In such case, you need to add more nodes or delete some data to free the space. + +### Bringing nodes online is slow + +Currently, bringing nodes online is scheduled through the balance region mechanism. You can refer to [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed) for troubleshooting. + +### Hot regions are not evenly distributed + +Hot regions scheduling issues generally fall into the following categories: + +- Hot regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot regions in time. + + **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD more sensitive to traffic changes. + +- Hotspot formed on a single region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a `split-region` operator to split such a region. + +- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules. + + **Solutions**: Firstly, locate the table where hot regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. + +### Region merge is slow + +Similar to slow scheduling, the speed of region merge is most likely limited by the configurations of `merge-schedule-limit` and `region-schedule-limit`, or the region merge scheduler is competing with other schedulers. Specifically, the solutions are: + +- If it is known from metrics that there are a large number of empty regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can set `patrol-region-interval` to `10ms`. This makes region scanning faster at the cost of more CPU consumption. + +- A lot of tables have been created and then emptied (including truncated tables). These empty regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: + + - TiKV: set `split-region-on-table` to `false` + - PD: set `namespace-classifier` to "" + +For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of `max-merge-region-keys`. To avoid this problem, you can adjust `max-merge-region-keys` to a larger value. + +### Troubleshoot TiKV node + +If a TiKV node fails, PD defaults to setting the corresponding node to the **down** state after 30 minutes (customizable by configuration item `max-store-down-time`), and rebalancing replicas for regions involved. + +Practically, if a node failure is considered unrecoverable, you can immediately take it offline. This makes PD replenish replicas soon in another node and reduces the risk of data loss. In contrast, if a node is considered recoverable, but the recovery cannot be done in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout. \ No newline at end of file diff --git a/v3.0/glossary.md b/v3.0/glossary.md new file mode 100644 index 0000000000000..dd1d60b829501 --- /dev/null +++ b/v3.0/glossary.md @@ -0,0 +1,67 @@ +--- +title: Glossary +summary: Glossaries about TiDB. +category: glossary +--- + +# Glossary + +## L + +### leader/follower/learner + +Leader/Follower/Learner each corresponds to a role in a Raft group of [peers](#regionpeerraft-group). The leader services all client requests and replicates data to the followers. If the group leader fails, one of the followers will be elected as the new leader. Learners are non-voting followers that only serves in the process of replica addition. + +## O + +### Operator + +An operator is a collection of actions that applies to a region for scheduling purposes. Operators perform scheduling tasks such as "migrate the leader of Region 2 to Store 5" and "migrate replicas of Region 2 to Store 1, 4, 5". + +An operator can be computed and generated by a [scheduler](#scheduler), or created by an external API. + +### Operator step + +An operator step is a step in the execution of an operator. An operator normally contains multiple Operator steps. + +Currently, available steps generated by PD include: + +- `TransferLeader`: Transfers leadership to a specified member +- `AddPeer`: Adds peers to a specified store +- `RemovePeer`: Removes a peer of a region +- `AddLearner`: Adds learners to a specified store +- `PromoteLearner`: Promotes a specified learner to a voting member +- `SplitRegion`: Splits a specified region into two + +## P + +### pending/down + +"Pending" and "down" are two special states of a peer. Pending indicates that the Raft log of followers or learners is vastly different from that of leader. Followers in pending cannot be elected as leader. "Down" refers to a state that a peer ceases to respond to leader for a long time, which usually means the corresponding node is down or isolated from the network. + +## R + +### region/peer/Raft group + +Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each region has three replicas by default. A replica of a region is called a peer. Multiple peers of the same region replicate data via the Raft consensus algorithm, so peers are also members of a Raft instance. TiKV uses Multi-Raft to manage data. That is, for each region, there is a corresponding, isolated Raft group. + +### region split + +Regions are generated as data writes increase. The process of splitting is called region split. + +The mechanism of region split is to use one initial region to cover the entire key space, and generate new regions through splitting existing ones every time the size of the region or the number of keys has reached a threshold. + +## S + +### scheduler + +Schedulers are components in PD that generate scheduling tasks. Each scheduler in PD runs independently and serves different purposes. The commonly used schedulers are: + +- `balance-leader-scheduler`: Balances the distribution of leaders +- `balance-region-scheduler`: Balances the distribution of peers +- `hot-region-scheduler`: Balances the distribution of hot regions +- `evict-leader-{store-id}`: Evicts all leaders of a node (often used for rolling upgrades) + +### Store + +A store refers to the storage node in the TiKV cluster (an instance of `tikv-server`). Each store has a corresponding TiKV instance. \ No newline at end of file diff --git a/v3.0/reference/best-practices/pd-scheduling.md b/v3.0/reference/best-practices/pd-scheduling.md new file mode 100644 index 0000000000000..d823eb3b25ae2 --- /dev/null +++ b/v3.0/reference/best-practices/pd-scheduling.md @@ -0,0 +1,270 @@ +--- +title: PD Scheduling +summary: Learn best practice and strategy for PD scheduling. +category: reference +--- + +# PD Scheduling + +This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: + +- [leader/follower/learner](/v3.0/glossary.md#leaderfollowerlearner) +- [operator](/v3.0/glossary.md#operator) +- [operator step](/v3.0/glossary.md#operator-step) +- [pending/down](/v3.0/glossary.md#pendingdown) +- [region/peer/Raft group](/v3.0/glossary.md#regionpeerraft-group) +- [region split](/v3.0/glossary.md#region-split) +- [scheduler](/v3.0/glossary.md#scheduler) +- [store](/v3.0/glossary.md#store) + +> **Note:** +> +> This document initially targets TiDB 3.0. Although some features are not supported in earlier versions (2.x), the underlying mechanisms are similar and this document can still be used as a reference. + +## PD scheduling policies + +This section introduces the principles and processes involved in the scheduling system. + +### Scheduling process + +The scheduling process generally has three steps: + +1. Collect information + + Each TiKV node periodically reports two types of heartbeats to PD: + + - `StoreHeartbeat`: Contains the overall information of stores, including disk capacity, available storage, and read/write traffic + - `RegionHeartbeat`: Contains the overall information of regions, including the range of each region, peer distribution, peer status, data volume, and read/write traffic + + PD collects and restores this information for scheduling decisions. + +2. Generate operators + + Different schedulers generate the operators based on their own logic and requirements, with the following considerations: + + - Do not add peers to a store in abnormal states (disconnected, down, busy, low space) + - Do not balance regions in abnormal states + - Do not transfer a leader to a pending peer + - Do not remove a leader directly + - Do not break the physical isolation of various region peers + - Do not violate constraints such as label property + +3. Execute operators + + To execute the operators, the general procedure is: + + 1. The generated operator first joins a queue managed by `OperatorController`. + + 2. `OperatorController` takes the operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to assign each operator step to the corresponding region leader. + + 3. The operator is marked as "finish" or "timeout" and removed from the queue. + +### Load balancing + +Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing regions evenly across all stores in the cluster but with separate focuses: `balance-leader` deals with region leader to balance incoming client requests, whereas `balance-region` concerns itself with each region peer to redistribute the pressure of storage and avoid exceptions like out of storage space. + +`balance-leader` and `balance-region` share a similar scheduling process: + +1. Rate stores according to their resource availability. +2. `balance-leader` or `balance-region` constantly transfer leaders or peers from stores with high scores to those with low scores. + +However, their rating methods are different. `balance-leader` uses the sum of all region sizes corresponding to leaders in a store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might: + +- based on the amount of data when there is sufficient storage (to balance data distribution among nodes). +- based on the available storage when there is insufficient storage (to balance the storage availability on different nodes). +- based on the weighted sum of the two factors above when neither of the situations applies. + +Because different nodes might differ in performance, you can also set the weight of load balancing for different stores. `leader-weight` and `region-weight` are used to control the leader weight and region weight respectively ("1" by default for both). For example, when the `leader-weight` of a store is set to "2", the number of leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a store is set to "0.5", the number of leaders on the node is about half as many as that of other nodes. + +### Hot regions scheduling + +For hot regions scheduling, use `hot-region-scheduler`. Currently in TiDB 3.0, the process is performed as follows: + +1. Count hot regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by stores. + +2. Redistribute these regions in a similar way to load balancing. + +For hot write regions, `hot-region-scheduler` attempts to redistribute both region peers and leaders; for hot read regions, `hot-region-scheduler` only redistributes region leaders. + +### Cluster topology awareness + +Cluster topology awareness enables PD to distribute replicas of a region as much as possible. This is how TiKV ensures high availability and disaster recovery capability. PD continuously scans all regions in the background. When PD finds that the distribution of regions is not optimal, it generates an operator to replace peers and redistribute regions. + +The component to check region distribution is `replicaChecker`, which is similar to a scheduler except that it cannot be disabled. `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone,rack,host]` defines a three-tier topology for a cluster. PD attempts to schedule region peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. + +### Scale-down and failure recovery + +Scale-down refers to the process when you take a store offline and mark it as "offline" using a command. PD replicates the regions on the offline node to other nodes by scheduling. Failure recovery applies when stores failed and cannot be recovered. In this case, regions with peers distributed on the corresponding store might lose replicas, which requires PD to replenish on other nodes. + +The processes of scale-down and failure recovery are basically the same. `replicaChecker` finds a region peer in abnormal states, and then generates an operator to replace the abnormal peer with a new one on a healthy store. + +### Region merge + +Region merge refers to the process of merging adjacent small regions. It serves to avoid unnecessary resource consumption by a large number of small or even empty regions after data deletion. Region merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all regions in the background, and generates an operator when contiguous small regions are found. + +## Query scheduling status + +You can check the status of scheduling system through metrics, pd-ctl and logs. This section briefly introduces the methods of metrics and pd-ctl. Refer to [PD Monitoring Metrics](/v3.0/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control](/v3.0/reference/tools/pd-control.md) for details. + +### Operator status + +The **Grafana PD/Operator** page shows the metrics about operators, among which: + +- Schedule operator create: Operator creating information +- Operator finish duration: Execution time consumed by each operator +- Operator step duration: Execution time consumed by the operator step + +You can query operators using pd-ctl with the following commands: + +- `operator show`: Queries all operators generated in the current scheduling task +- `operator show [admin | leader | region]`: Queries operators by type + +### Balance status + +The **Grafana PD/Statistics - Balance** page shows the metrics about load balancing, among which: + +- Store leader/region score: Score of each store +- Store leader/region count: The number of leaders/regions in each store +- Store available: Available storage on each store + +You can use store commands of pd-ctl to query balance status of each store. + +### Hot Region status + +The **Grafana PD/Statistics - hotspot** page shows the metrics about hot regions, among which: + +- Hot write region’s leader/peer distribution: the leader/peer distribution in hot write regions +- Hot read region’s leader distribution: the leader distribution in hot read regions + +You can also query the status of hot regions using pd-ctl with the following commands: + +- `hot read`: Queries hot read regions +- `hot write`: Queries hot write regions +- `hot store`: Queries the distribution of hot regions by store +- `region topread [limit]`: Queries the region with top read traffic +- `region topwrite [limit]`: Queries the region with top write traffic + +### Region health + +The **Grafana PD/Cluster/Region health** panel shows the metrics about regions in abnormal states. + +You can query the list of regions in abnormal states using pd-ctl with region check commands: + +- `region check miss-peer`: Queries regions without enough peers +- `region check extra-peer`: Queries regions with extra peers +- `region check down-peer`: Queries regions with down peers +- `region check pending-peer`: Queries regions with pending peers + +## Control scheduling strategy + +You can use pd-ctl to adjust the scheduling strategy from the following three aspects. Refer to [PD Control](/v3.0/reference/tools/pd-control.md) for more details. + +### Add/delete scheduler manually + +PD supports dynamically adding and removing schedulers directly through pd-ctl. For example: + +- `scheduler show`: Shows currently running schedulers in the system +- `scheduler remove balance-leader-scheduler`: Removes (disable) balance-leader-scheduler +- `scheduler add evict-leader-scheduler-1`: Adds a scheduler to remove all leaders in Store 1 + +### Add/delete Operators manually + +PD also supports adding or removing operators directly through pd-ctl. For example: + +- `operator add add-peer 2 5`: Adds peers to Region 2 in Store 5 +- `operator add transfer-leader 2 5`: Migrates the leader of Region 2 to Store 5 +- `operator add split-region 2`: Splits Region 2 into two regions evenly in size +- `operator remove 2`: Removes currently pending operator in Region 2 + +### Adjust scheduling parameter + +You can check the scheduling configuration using the `config show` command in pd-ctl, and adjust the values using `config set {key} {value}`. Common adjustments include: + +- `leader-schedule-limit`: Controls the concurrency of transferring leader scheduling +- `region-schedule-limit`: Controls the concurrency of adding/deleting peer scheduling +- `disable-replace-offline-replica`: Determines whether to disable the scheduling to take nodes offline +- `disable-location-replacement`: Determines whether to disable the scheduling that handles the isolation level of regions +- `max-snapshot-count`: Controls the maximum concurrency of sending/receiving snapshots for each store + +## PD scheduling in common scenarios + +This section illustrates the best practices of PD scheduling strategies through several typical scenarios. + +### Leaders/regions are not evenly distributed + +The rating mechanism of PD determines that leader count and region count of different stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or storage usage. + +Once you have confirmed that leaders/region are not evenly distributed, you need to check the rating of different stores. + +If the scores of different stores are close, it means PD mistakenly believes that leaders/regions are evenly distributed. Possible reasons are: + +- There are hot regions that cause load imbalancing. In this case, you need to analyze further based on [hot regions scheduling](#hot-regions-are-not-evenly-distributed). +- There are a large number of empty regions or small regions, which leads to a great difference in the number of leaders in different stores and high pressure on Raft store. This is the time for a [region merge](#region-merge-is-slow) scheduling. +- Hardware and software environment varies among stores. You can adjust the values of `leader-weight` and `region-weight` accordingly to control the distribution of leader/region. +- Other unknown reasons. Still you can adjust the values of `leader-weight` and `region-weight` to control the distribution of leader/region. + +If there is a big difference in the rating of different stores, you need to examine the operator-related metrics, with special focus on the generation and execution of operators. There are two main situations: + +- When operators are generated normally but the scheduling process is slow, it is possible that: + + - The scheduling speed is limited by default for load balancing purpose. You can adjust `leader-schedule-limit` or `region-schedule-limit` to larger values without significantly impacting regular services. In addition, you can also properly ease the restrictions specified by `max-pending-peer-count` and `max-snapshot-count`. + - Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of `region-schedule-limit`. In this case, you can limit the speed of scheduler to remove nodes, or simply set `disable-replace-offline-replica = true` to temporarily disable it. + - The scheduling process is too slow. You can check the **Operator step duration** metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as `TransferLeader`, `RemovePeer`, `PromoteLearner`) should be completed in milliseconds, while steps that involve snapshots (such as `AddLearner` and `AddPeer`) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis. + +- PD fails to generate the corresponding balancing scheduler. Possible reasons include: + + - The scheduler is not activated. For example, the corresponding scheduler is deleted, or its limit it set to "0". + - Other constraints. For example, `evict-leader-scheduler` in the system prevents leaders from being migrating to the corresponding store. Or label property is set, which makes some stores reject leaders. + - Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each region are distributed in different data centers due to replica isolation. If the number of stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally. + +### Taking nodes offline is slow + +This scenario requires examining the generation and execution of operators through related metrics. + +If operators are successfully generated but the scheduling process is slow, possible reasons are: + +- The scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to larger value.s Similarly, you can consider loosening the limits on `max-pending-peer-count` and `max-snapshot-count`. +- Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed). +- When you take a single node offline, a number of region leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an `evict-leader-scheduler` to migrate leaders. + +If the corresponding operator fails to generate, possible reasons are: + +- The operator is stopped, or `replica-schedule-limit` is set to "0". +- There is no proper node for region migration. For example, if the available capacity size of the replacing node (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage on that node. In such case, you need to add more nodes or delete some data to free the space. + +### Bringing nodes online is slow + +Currently, bringing nodes online is scheduled through the balance region mechanism. You can refer to [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed) for troubleshooting. + +### Hot regions are not evenly distributed + +Hot regions scheduling issues generally fall into the following categories: + +- Hot regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot regions in time. + + **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD more sensitive to traffic changes. + +- Hotspot formed on a single region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a `split-region` operator to split such a region. + +- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules. + + **Solutions**: Firstly, locate the table where hot regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. + +### Region merge is slow + +Similar to slow scheduling, the speed of region merge is most likely limited by the configurations of `merge-schedule-limit` and `region-schedule-limit`, or the region merge scheduler is competing with other schedulers. Specifically, the solutions are: + +- If it is known from metrics that there are a large number of empty regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can set `patrol-region-interval` to `10ms`. This makes region scanning faster at the cost of more CPU consumption. + +- A lot of tables have been created and then emptied (including truncated tables). These empty regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: + + - TiKV: set `split-region-on-table` to `false` + - PD: set `namespace-classifier` to "" + +For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of `max-merge-region-keys`. To avoid this problem, you can adjust `max-merge-region-keys` to a larger value. + +### Troubleshoot TiKV node + +If a TiKV node fails, PD defaults to setting the corresponding node to the **down** state after 30 minutes (customizable by configuration item `max-store-down-time`), and rebalancing replicas for regions involved. + +Practically, if a node failure is considered unrecoverable, you can immediately take it offline. This makes PD replenish replicas soon in another node and reduces the risk of data loss. In contrast, if a node is considered recoverable, but the recovery cannot be done in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout. \ No newline at end of file diff --git a/v3.1/glossary.md b/v3.1/glossary.md new file mode 100644 index 0000000000000..dd1d60b829501 --- /dev/null +++ b/v3.1/glossary.md @@ -0,0 +1,67 @@ +--- +title: Glossary +summary: Glossaries about TiDB. +category: glossary +--- + +# Glossary + +## L + +### leader/follower/learner + +Leader/Follower/Learner each corresponds to a role in a Raft group of [peers](#regionpeerraft-group). The leader services all client requests and replicates data to the followers. If the group leader fails, one of the followers will be elected as the new leader. Learners are non-voting followers that only serves in the process of replica addition. + +## O + +### Operator + +An operator is a collection of actions that applies to a region for scheduling purposes. Operators perform scheduling tasks such as "migrate the leader of Region 2 to Store 5" and "migrate replicas of Region 2 to Store 1, 4, 5". + +An operator can be computed and generated by a [scheduler](#scheduler), or created by an external API. + +### Operator step + +An operator step is a step in the execution of an operator. An operator normally contains multiple Operator steps. + +Currently, available steps generated by PD include: + +- `TransferLeader`: Transfers leadership to a specified member +- `AddPeer`: Adds peers to a specified store +- `RemovePeer`: Removes a peer of a region +- `AddLearner`: Adds learners to a specified store +- `PromoteLearner`: Promotes a specified learner to a voting member +- `SplitRegion`: Splits a specified region into two + +## P + +### pending/down + +"Pending" and "down" are two special states of a peer. Pending indicates that the Raft log of followers or learners is vastly different from that of leader. Followers in pending cannot be elected as leader. "Down" refers to a state that a peer ceases to respond to leader for a long time, which usually means the corresponding node is down or isolated from the network. + +## R + +### region/peer/Raft group + +Region is the minimal piece of data storage in TiKV, each representing a range of data (96 MiB by default). Each region has three replicas by default. A replica of a region is called a peer. Multiple peers of the same region replicate data via the Raft consensus algorithm, so peers are also members of a Raft instance. TiKV uses Multi-Raft to manage data. That is, for each region, there is a corresponding, isolated Raft group. + +### region split + +Regions are generated as data writes increase. The process of splitting is called region split. + +The mechanism of region split is to use one initial region to cover the entire key space, and generate new regions through splitting existing ones every time the size of the region or the number of keys has reached a threshold. + +## S + +### scheduler + +Schedulers are components in PD that generate scheduling tasks. Each scheduler in PD runs independently and serves different purposes. The commonly used schedulers are: + +- `balance-leader-scheduler`: Balances the distribution of leaders +- `balance-region-scheduler`: Balances the distribution of peers +- `hot-region-scheduler`: Balances the distribution of hot regions +- `evict-leader-{store-id}`: Evicts all leaders of a node (often used for rolling upgrades) + +### Store + +A store refers to the storage node in the TiKV cluster (an instance of `tikv-server`). Each store has a corresponding TiKV instance. \ No newline at end of file diff --git a/v3.1/reference/best-practices/pd-scheduling.md b/v3.1/reference/best-practices/pd-scheduling.md new file mode 100644 index 0000000000000..e8beb64e88943 --- /dev/null +++ b/v3.1/reference/best-practices/pd-scheduling.md @@ -0,0 +1,270 @@ +--- +title: PD Scheduling +summary: Learn best practice and strategy for PD scheduling. +category: reference +--- + +# PD Scheduling + +This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: + +- [leader/follower/learner](/v3.1/glossary.md#leaderfollowerlearner) +- [operator](/v3.1/glossary.md#operator) +- [operator step](/v3.1/glossary.md#operator-step) +- [pending/down](/v3.1/glossary.md#pendingdown) +- [region/peer/Raft group](/v3.1/glossary.md#regionpeerraft-group) +- [region split](/v3.1/glossary.md#region-split) +- [scheduler](/v3.1/glossary.md#scheduler) +- [store](/v3.1/glossary.md#store) + +> **Note:** +> +> This document initially targets TiDB 3.0. Although some features are not supported in earlier versions (2.x), the underlying mechanisms are similar and this document can still be used as a reference. + +## PD scheduling policies + +This section introduces the principles and processes involved in the scheduling system. + +### Scheduling process + +The scheduling process generally has three steps: + +1. Collect information + + Each TiKV node periodically reports two types of heartbeats to PD: + + - `StoreHeartbeat`: Contains the overall information of stores, including disk capacity, available storage, and read/write traffic + - `RegionHeartbeat`: Contains the overall information of regions, including the range of each region, peer distribution, peer status, data volume, and read/write traffic + + PD collects and restores this information for scheduling decisions. + +2. Generate operators + + Different schedulers generate the operators based on their own logic and requirements, with the following considerations: + + - Do not add peers to a store in abnormal states (disconnected, down, busy, low space) + - Do not balance regions in abnormal states + - Do not transfer a leader to a pending peer + - Do not remove a leader directly + - Do not break the physical isolation of various region peers + - Do not violate constraints such as label property + +3. Execute operators + + To execute the operators, the general procedure is: + + 1. The generated operator first joins a queue managed by `OperatorController`. + + 2. `OperatorController` takes the operator out of the queue and executes it with a certain amount of concurrency based on the configuration. This step is to assign each operator step to the corresponding region leader. + + 3. The operator is marked as "finish" or "timeout" and removed from the queue. + +### Load balancing + +Region primarily relies on `balance-leader` and `balance-region` schedulers to achieve load balance. Both schedulers target distributing regions evenly across all stores in the cluster but with separate focuses: `balance-leader` deals with region leader to balance incoming client requests, whereas `balance-region` concerns itself with each region peer to redistribute the pressure of storage and avoid exceptions like out of storage space. + +`balance-leader` and `balance-region` share a similar scheduling process: + +1. Rate stores according to their resource availability. +2. `balance-leader` or `balance-region` constantly transfer leaders or peers from stores with high scores to those with low scores. + +However, their rating methods are different. `balance-leader` uses the sum of all region sizes corresponding to leaders in a store, whereas the way of `balance-region` is relatively complicated. Depending on the specific storage capacity of each node, the rating method of `balance-region` might: + +- based on the amount of data when there is sufficient storage (to balance data distribution among nodes). +- based on the available storage when there is insufficient storage (to balance the storage availability on different nodes). +- based on the weighted sum of the two factors above when neither of the situations applies. + +Because different nodes might differ in performance, you can also set the weight of load balancing for different stores. `leader-weight` and `region-weight` are used to control the leader weight and region weight respectively ("1" by default for both). For example, when the `leader-weight` of a store is set to "2", the number of leaders on the node is about twice as many as that of other nodes after the scheduling stabilizes. Similarly, when the `leader-weight` of a store is set to "0.5", the number of leaders on the node is about half as many as that of other nodes. + +### Hot regions scheduling + +For hot regions scheduling, use `hot-region-scheduler`. Currently in TiDB 3.0, the process is performed as follows: + +1. Count hot regions by determining read/write traffic that exceeds a certain threshold for a certain period based on the information reported by stores. + +2. Redistribute these regions in a similar way to load balancing. + +For hot write regions, `hot-region-scheduler` attempts to redistribute both region peers and leaders; for hot read regions, `hot-region-scheduler` only redistributes region leaders. + +### Cluster topology awareness + +Cluster topology awareness enables PD to distribute replicas of a region as much as possible. This is how TiKV ensures high availability and disaster recovery capability. PD continuously scans all regions in the background. When PD finds that the distribution of regions is not optimal, it generates an operator to replace peers and redistribute regions. + +The component to check region distribution is `replicaChecker`, which is similar to a scheduler except that it cannot be disabled. `replicaChecker` schedules based on the the configuration of `location-labels`. For example, `[zone,rack,host]` defines a three-tier topology for a cluster. PD attempts to schedule region peers to different zones first, or to different racks when zones are insufficient (for example, 2 zones for 3 replicas), or to different hosts when racks are insufficient, and so on. + +### Scale-down and failure recovery + +Scale-down refers to the process when you take a store offline and mark it as "offline" using a command. PD replicates the regions on the offline node to other nodes by scheduling. Failure recovery applies when stores failed and cannot be recovered. In this case, regions with peers distributed on the corresponding store might lose replicas, which requires PD to replenish on other nodes. + +The processes of scale-down and failure recovery are basically the same. `replicaChecker` finds a region peer in abnormal states, and then generates an operator to replace the abnormal peer with a new one on a healthy store. + +### Region merge + +Region merge refers to the process of merging adjacent small regions. It serves to avoid unnecessary resource consumption by a large number of small or even empty regions after data deletion. Region merge is performed by `mergeChecker`, which processes in a similar way to `replicaChecker`: PD continuously scans all regions in the background, and generates an operator when contiguous small regions are found. + +## Query scheduling status + +You can check the status of scheduling system through metrics, pd-ctl and logs. This section briefly introduces the methods of metrics and pd-ctl. Refer to [PD Monitoring Metrics](/v3.1/reference/key-monitoring-metrics/pd-dashboard.md) and [PD Control](/v3.1/reference/tools/pd-control.md) for details. + +### Operator status + +The **Grafana PD/Operator** page shows the metrics about operators, among which: + +- Schedule operator create: Operator creating information +- Operator finish duration: Execution time consumed by each operator +- Operator step duration: Execution time consumed by the operator step + +You can query operators using pd-ctl with the following commands: + +- `operator show`: Queries all operators generated in the current scheduling task +- `operator show [admin | leader | region]`: Queries operators by type + +### Balance status + +The **Grafana PD/Statistics - Balance** page shows the metrics about load balancing, among which: + +- Store leader/region score: Score of each store +- Store leader/region count: The number of leaders/regions in each store +- Store available: Available storage on each store + +You can use store commands of pd-ctl to query balance status of each store. + +### Hot Region status + +The **Grafana PD/Statistics - hotspot** page shows the metrics about hot regions, among which: + +- Hot write region’s leader/peer distribution: the leader/peer distribution in hot write regions +- Hot read region’s leader distribution: the leader distribution in hot read regions + +You can also query the status of hot regions using pd-ctl with the following commands: + +- `hot read`: Queries hot read regions +- `hot write`: Queries hot write regions +- `hot store`: Queries the distribution of hot regions by store +- `region topread [limit]`: Queries the region with top read traffic +- `region topwrite [limit]`: Queries the region with top write traffic + +### Region health + +The **Grafana PD/Cluster/Region health** panel shows the metrics about regions in abnormal states. + +You can query the list of regions in abnormal states using pd-ctl with region check commands: + +- `region check miss-peer`: Queries regions without enough peers +- `region check extra-peer`: Queries regions with extra peers +- `region check down-peer`: Queries regions with down peers +- `region check pending-peer`: Queries regions with pending peers + +## Control scheduling strategy + +You can use pd-ctl to adjust the scheduling strategy from the following three aspects. Refer to [PD Control](/v3.1/reference/tools/pd-control.md) for more details. + +### Add/delete scheduler manually + +PD supports dynamically adding and removing schedulers directly through pd-ctl. For example: + +- `scheduler show`: Shows currently running schedulers in the system +- `scheduler remove balance-leader-scheduler`: Removes (disable) balance-leader-scheduler +- `scheduler add evict-leader-scheduler-1`: Adds a scheduler to remove all leaders in Store 1 + +### Add/delete Operators manually + +PD also supports adding or removing operators directly through pd-ctl. For example: + +- `operator add add-peer 2 5`: Adds peers to Region 2 in Store 5 +- `operator add transfer-leader 2 5`: Migrates the leader of Region 2 to Store 5 +- `operator add split-region 2`: Splits Region 2 into two regions evenly in size +- `operator remove 2`: Removes currently pending operator in Region 2 + +### Adjust scheduling parameter + +You can check the scheduling configuration using the `config show` command in pd-ctl, and adjust the values using `config set {key} {value}`. Common adjustments include: + +- `leader-schedule-limit`: Controls the concurrency of transferring leader scheduling +- `region-schedule-limit`: Controls the concurrency of adding/deleting peer scheduling +- `disable-replace-offline-replica`: Determines whether to disable the scheduling to take nodes offline +- `disable-location-replacement`: Determines whether to disable the scheduling that handles the isolation level of regions +- `max-snapshot-count`: Controls the maximum concurrency of sending/receiving snapshots for each store + +## PD scheduling in common scenarios + +This section illustrates the best practices of PD scheduling strategies through several typical scenarios. + +### Leaders/regions are not evenly distributed + +The rating mechanism of PD determines that leader count and region count of different stores cannot fully reflect the load balancing status. Therefore, it is necessary to confirm whether there is load imbalancing from the actual load of TiKV or storage usage. + +Once you have confirmed that leaders/region are not evenly distributed, you need to check the rating of different stores. + +If the scores of different stores are close, it means PD mistakenly believes that leaders/regions are evenly distributed. Possible reasons are: + +- There are hot regions that cause load imbalancing. In this case, you need to analyze further based on [hot regions scheduling](#hot-regions-are-not-evenly-distributed). +- There are a large number of empty regions or small regions, which leads to a great difference in the number of leaders in different stores and high pressure on Raft store. This is the time for a [region merge](#region-merge-is-slow) scheduling. +- Hardware and software environment varies among stores. You can adjust the values of `leader-weight` and `region-weight` accordingly to control the distribution of leader/region. +- Other unknown reasons. Still you can adjust the values of `leader-weight` and `region-weight` to control the distribution of leader/region. + +If there is a big difference in the rating of different stores, you need to examine the operator-related metrics, with special focus on the generation and execution of operators. There are two main situations: + +- When operators are generated normally but the scheduling process is slow, it is possible that: + + - The scheduling speed is limited by default for load balancing purpose. You can adjust `leader-schedule-limit` or `region-schedule-limit` to larger values without significantly impacting regular services. In addition, you can also properly ease the restrictions specified by `max-pending-peer-count` and `max-snapshot-count`. + - Other scheduling tasks are running concurrently, which slows down the balancing. In this case, if the balancing takes precedence over other scheduling tasks, you can stop other tasks or limit their speeds. For example, if you take some nodes offline when balancing is in progress, both operations consume the quota of `region-schedule-limit`. In this case, you can limit the speed of scheduler to remove nodes, or simply set `disable-replace-offline-replica = true` to temporarily disable it. + - The scheduling process is too slow. You can check the **Operator step duration** metric to confirm the cause. Generally, steps that do not involve sending and receiving snapshots (such as `TransferLeader`, `RemovePeer`, `PromoteLearner`) should be completed in milliseconds, while steps that involve snapshots (such as `AddLearner` and `AddPeer`) are expected to be completed in tens of seconds. If the duration is obviously too long, it could be caused by high pressure on TiKV or bottleneck in network, etc., which needs specific analysis. + +- PD fails to generate the corresponding balancing scheduler. Possible reasons include: + + - The scheduler is not activated. For example, the corresponding scheduler is deleted, or its limit it set to "0". + - Other constraints. For example, `evict-leader-scheduler` in the system prevents leaders from being migrating to the corresponding store. Or label property is set, which makes some stores reject leaders. + - Restrictions from the cluster topology. For example, in a cluster of 3 replicas across 3 data centers, 3 replicas of each region are distributed in different data centers due to replica isolation. If the number of stores is different among these data centers, the scheduling can only reach a balanced state within each data center, but not balanced globally. + +### Taking nodes offline is slow + +This scenario requires examining the generation and execution of operators through related metrics. + +If operators are successfully generated but the scheduling process is slow, possible reasons are: + +- The scheduling speed is limited by default. You can adjust `leader-schedule-limit` or `replica-schedule-limit` to larger value.s Similarly, you can consider loosening the limits on `max-pending-peer-count` and `max-snapshot-count`. +- Other scheduling tasks are running concurrently and racing for resources in the system. You can refer to the solution in [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed). +- When you take a single node offline, a number of region leaders to be processed (around 1/3 under the configuration of 3 replicas) are distributed on the node to remove. Therefore, the speed is limited by the speed at which snapshots are generated by this single node. You can speed it up by manually adding an `evict-leader-scheduler` to migrate leaders. + +If the corresponding operator fails to generate, possible reasons are: + +- The operator is stopped, or `replica-schedule-limit` is set to "0". +- There is no proper node for region migration. For example, if the available capacity size of the replacing node (of the same label) is less than 20%, PD will stop scheduling to avoid running out of storage on that node. In such case, you need to add more nodes or delete some data to free the space. + +### Bringing nodes online is slow + +Currently, bringing nodes online is scheduled through the balance region mechanism. You can refer to [Leaders/regions are not evenly distributed](#leadersregions-are-not-evenly-distributed) for troubleshooting. + +### Hot regions are not evenly distributed + +Hot regions scheduling issues generally fall into the following categories: + +- Hot regions can be observed via PD metrics, but the scheduling speed cannot keep up to redistribute hot regions in time. + + **Solution**: adjust `hot-region-schedule-limit` to a larger value, and reduce the limit quota of other schedulers to speed up hot regions scheduling. Or you can adjust `hot-region-cache-hits-threshold` to a smaller value to make PD more sensitive to traffic changes. + +- Hotspot formed on a single region. For example, a small table is intensively scanned by a massive amount of requests. This can also be detected from PD metrics. Because you cannot actually distribute a single hotspot, you need to manually add a `split-region` operator to split such a region. + +- The load of some nodes is significantly higher than that of other nodes from TiKV-related metrics, which becomes the bottleneck of the whole system. Currently, PD counts hotspots through traffic analysis only, so it is possible that PD fails to identify hotspots in certain scenarios. For example, when there are intensive point lookup requests for some regions, it might not be obvious to detect in traffic, but still the high QPS might lead to bottlenecks in key modules. + + **Solutions**: Firstly, locate the table where hot regions are formed based on the specific business. Then add a `scatter-range-scheduler` scheduler to make all regions of this table evenly distributed. TiDB also provides an interface in its HTTP API to simplify this operation. Refer to [TiDB HTTP API](https://github.com/pingcap/tidb/blob/master/docs/tidb_http_api.md) for more details. + +### Region merge is slow + +Similar to slow scheduling, the speed of region merge is most likely limited by the configurations of `merge-schedule-limit` and `region-schedule-limit`, or the region merge scheduler is competing with other schedulers. Specifically, the solutions are: + +- If it is known from metrics that there are a large number of empty regions in the system, you can adjust `max-merge-region-size` and `max-merge-region-keys` to smaller values to speed up the merge. This is because the merge process involves replica migration, so the smaller the region to be merged, the faster the merge is. If the merge operators are already generated rapidly, to further speed up the process, you can set `patrol-region-interval` to `10ms`. This makes region scanning faster at the cost of more CPU consumption. + +- A lot of tables have been created and then emptied (including truncated tables). These empty regions cannot be merged if the split table attribute is enabled. You can disable this attribute by adjusting the following parameters: + + - TiKV: set `split-region-on-table` to `false` + - PD: set `namespace-classifier` to "" + +For v3.0.4 and v2.1.16 or earlier, the `approximate_keys` of regions are inaccurate in specific circumstances (most of which occur after dropping tables), which makes the number of keys break the constraints of `max-merge-region-keys`. To avoid this problem, you can adjust `max-merge-region-keys` to a larger value. + +### Troubleshoot TiKV node + +If a TiKV node fails, PD defaults to setting the corresponding node to the **down** state after 30 minutes (customizable by configuration item `max-store-down-time`), and rebalancing replicas for regions involved. + +Practically, if a node failure is considered unrecoverable, you can immediately take it offline. This makes PD replenish replicas soon in another node and reduces the risk of data loss. In contrast, if a node is considered recoverable, but the recovery cannot be done in 30 minutes, you can temporarily adjust `max-store-down-time` to a larger value to avoid unnecessary replenishment of the replicas and resources waste after the timeout. \ No newline at end of file From f56a87853b5d80310c685c7fb64963df3cd44bc6 Mon Sep 17 00:00:00 2001 From: anotherrachel Date: Mon, 9 Dec 2019 14:38:32 +0800 Subject: [PATCH 8/8] update toc --- dev/TOC.md | 1 + dev/reference/best-practices/pd-scheduling.md | 4 ++-- v2.1/TOC.md | 1 + v2.1/reference/best-practices/pd-scheduling.md | 4 ++-- v3.0/TOC.md | 1 + v3.0/reference/best-practices/pd-scheduling.md | 4 ++-- v3.1/TOC.md | 1 + v3.1/reference/best-practices/pd-scheduling.md | 4 ++-- 8 files changed, 12 insertions(+), 8 deletions(-) diff --git a/dev/TOC.md b/dev/TOC.md index f1c426930a071..28fc818457bbd 100644 --- a/dev/TOC.md +++ b/dev/TOC.md @@ -289,6 +289,7 @@ + Best Practices - [Highly Concurrent Write Best Practices](/dev/reference/best-practices/high-concurrency.md) - [HAProxy Best Practices](/dev/reference/best-practices/haproxy.md) + - [PD Scheduling Best Practices](/dev/reference/best-practices/pd-scheduling.md) - [TiSpark](/dev/reference/tispark.md) + TiDB Binlog - [Overview](/dev/reference/tidb-binlog/overview.md) diff --git a/dev/reference/best-practices/pd-scheduling.md b/dev/reference/best-practices/pd-scheduling.md index 1089203f10227..438a4e3f95bb6 100644 --- a/dev/reference/best-practices/pd-scheduling.md +++ b/dev/reference/best-practices/pd-scheduling.md @@ -1,10 +1,10 @@ --- -title: PD Scheduling +title: PD Scheduling Best Practices summary: Learn best practice and strategy for PD scheduling. category: reference --- -# PD Scheduling +# PD Scheduling Best Practices This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: diff --git a/v2.1/TOC.md b/v2.1/TOC.md index 12b3cc6a1a8ff..a916762aa00d6 100644 --- a/v2.1/TOC.md +++ b/v2.1/TOC.md @@ -260,6 +260,7 @@ - [Alert Rules](/v2.1/reference/alert-rules.md) + Best Practices - [HAProxy Best Practices](/v2.1/reference/best-practices/haproxy.md) + - [PD Scheduling Best Practices](/v2.1/reference/best-practices/pd-scheduling.md) - [TiSpark](/v2.1/reference/tispark.md) + TiDB Binlog - [Overview](/v2.1/reference/tidb-binlog/overview.md) diff --git a/v2.1/reference/best-practices/pd-scheduling.md b/v2.1/reference/best-practices/pd-scheduling.md index e93ea90b45bb5..b78e563ee60c0 100644 --- a/v2.1/reference/best-practices/pd-scheduling.md +++ b/v2.1/reference/best-practices/pd-scheduling.md @@ -1,10 +1,10 @@ --- -title: PD Scheduling +title: PD Scheduling Best Practices summary: Learn best practice and strategy for PD scheduling. category: reference --- -# PD Scheduling +# PD Scheduling Best Practices This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: diff --git a/v3.0/TOC.md b/v3.0/TOC.md index 28607afdfee65..b509d95a9fcd7 100644 --- a/v3.0/TOC.md +++ b/v3.0/TOC.md @@ -285,6 +285,7 @@ + Best Practices - [Highly Concurrent Write Best Practices](/v3.0/reference/best-practices/high-concurrency.md) - [HAProxy Best Practices](/v3.0/reference/best-practices/haproxy.md) + - [PD Scheduling Best Practices](/v3.0/reference/best-practices/pd-scheduling.md) - [TiSpark](/v3.0/reference/tispark.md) + TiDB Binlog - [Overview](/v3.0/reference/tidb-binlog/overview.md) diff --git a/v3.0/reference/best-practices/pd-scheduling.md b/v3.0/reference/best-practices/pd-scheduling.md index d823eb3b25ae2..efd29b8f0c172 100644 --- a/v3.0/reference/best-practices/pd-scheduling.md +++ b/v3.0/reference/best-practices/pd-scheduling.md @@ -1,10 +1,10 @@ --- -title: PD Scheduling +title: PD Scheduling Best Practices summary: Learn best practice and strategy for PD scheduling. category: reference --- -# PD Scheduling +# PD Scheduling Best Practices This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: diff --git a/v3.1/TOC.md b/v3.1/TOC.md index 08de2f1f1c717..b6a4400bbde1f 100644 --- a/v3.1/TOC.md +++ b/v3.1/TOC.md @@ -288,6 +288,7 @@ + Best Practices - [Highly Concurrent Write Best Practices](/v3.1/reference/best-practices/high-concurrency.md) - [HAproxy Best Practices](/v3.1/reference/best-practices/haproxy.md) + - [PD Scheduling Best Practices](/v3.1/reference/best-practices/pd-scheduling.md) - [TiSpark](/v3.1/reference/tispark.md) + TiDB Binlog - [Overview](/v3.1/reference/tidb-binlog/overview.md) diff --git a/v3.1/reference/best-practices/pd-scheduling.md b/v3.1/reference/best-practices/pd-scheduling.md index e8beb64e88943..75663a5480afc 100644 --- a/v3.1/reference/best-practices/pd-scheduling.md +++ b/v3.1/reference/best-practices/pd-scheduling.md @@ -1,10 +1,10 @@ --- -title: PD Scheduling +title: PD Scheduling Best Practices summary: Learn best practice and strategy for PD scheduling. category: reference --- -# PD Scheduling +# PD Scheduling Best Practices This document details the principles and strategies of PD scheduling through common scenarios to facilitate your application. This document assumes that you have a basic understanding of TiDB, TiKV and PD with the following core concepts: