diff --git a/media/schedule-limit-new.png b/media/schedule-limit-new.png new file mode 100644 index 00000000..cc8d43ef Binary files /dev/null and b/media/schedule-limit-new.png differ diff --git a/media/schedule-limit-old.png b/media/schedule-limit-old.png new file mode 100644 index 00000000..c1ac00e5 Binary files /dev/null and b/media/schedule-limit-old.png differ diff --git a/text/2018-09-17-schedule-limit.md b/text/2018-09-17-schedule-limit.md new file mode 100644 index 00000000..1644a1d7 --- /dev/null +++ b/text/2018-09-17-schedule-limit.md @@ -0,0 +1,176 @@ +# Schedule Limit + +## Summary + +This RFC proposes a new scheme to limit concurrent executing schedule +operators. Compared to old implementation, it is more intuitive, easy to +configure, and has better adaptability. + +## Motivation + +In order to prevent PD from generating a large number of schedule operators in +a short period of time and affecting the performance of the TiKV cluster, and +to avoid the problem of over-balance, currently PD uses a limitation based +scheme. + +Specifically, PD uses 4 different limit configurations to control the number of +concurrently running operators: + +* _leader-schedule-limit_ + + Controls the number of operators scheduling Region leaders at the same + time. + +* _region-schedule-limit_ + + Controls the number of operators scheduling Region peers at the same time. + +* _replica-schedule-limit_ + + Controls the number of operators scheduling Region replicas at the same + time. + +* _merge-schedule-limit_ + + Controls the number of Region merge operators at the same time. + +![Old schedule limit scheme](../media/schedule-limit-old.png) + +The mechanism of scheduling limiter is shown in the figure. All running +operators are maintained as a map in the coordinator, and the Limiter is +updated synchronously when running operators change. Different schedulers will +check whether running operators have exceeded the limit before generating +operators. The operators generated by schedulers will be added to the +_runningOperators_ map via addOperator method. + +This scheme is not good in these aspects: + +* Various limit configurations are confusing, tricky to use correctly +* Fixed limit configuration does not apply to clusters of different sizes +* The preemption strategy between different schedulers is not fair enough +* Scheduling is often concentrated on a small number of `tikv-server`s, which + leads to slow scheduling sometimes and may affect the latency of TiKV (we do + consider snapshot count, but this statistic is reported after TiKV receives + the scheduling request) +* Rely on schedulers correctly checking limit to ensure the limit configuration + has not been violated -- hotbed of bugs. See + [#1193](https://github.com/pingcap/pd/pull/1193) + [#1155](https://github.com/pingcap/pd/pull/1155). + +## Detailed design + +### Introducing waitingOperators queue + +![New schedule limit scheme](../media/schedule-limit-new.png) + +As shown in the figure, the biggest change is the introduction of the +_waitingOperators_ in the coordinator. The schedule operators generated by +schedulers are not directly put into the _runningOperators_, but are queued +first in _waitingOperators_. The coordinator is responsible for continuously +promoting operators in _waitingOperators_ queue to _runningOperators_ according +to a certain rule and configuration. + +### Operator effect evaluation + +Schedule operators consist of operator steps. The effect of an operator on the +cluster can be calculated as the sum of the effects of all steps. There are +several types of operator steps: + +* _TransferLeader_ +* _AddPeer_ +* _AddLearner_ +* _PromoteLearner_ +* _RemovePeer_ +* _MergeRegion_ +* _SplitRegion_ + +These steps only affect a subset of all `tikv-server`s -- not the +entire cluster. For example, _TransferLeader_ can only affect the original and +the new leader of the Region. _AddLearner_ can affect the store to add the +learner and the leader of the Region. + +The overhead of different types of operator are different. We can +assign different cost values to them based on experience. For example, we can +arbitrarily set the cost of _TransferLeader_ to 1, the cost of _RemovePeer_ to +2, and the cost of _AddLearner_ to 5 (leader) and 8 (new peer). + +### How to promote waiting operators + +Given a _MaxScheduleCost_ configuration (say 20), when the coordinator promotes +an operator, it needs to guarantee that if we add all the schedule cost in the +_runningOperators_, the total cost of any store will not exceed the configured +value. + +If the first operator in the waiting queue makes a store overload, it will wait +until the conflicting operator finishes or times out before it can be moved to +_runningOperators_. + +In order to improve efficiency of executing, when a non-first operator does not +conflict with any operator in front of it, it can be moved to +_runningOperators_ too. + +### How to ensure fair competition between different schedulers + +This is actually quite straightforward. We only need to limit the number of +generated operators by each scheduler in the _waitingOperators_ queue. For +example, up to 3 operators from a same scheduler. + +Only when an old operator is moved into the _runningOperators_, the +corresponding scheduler will have the opportunity to insert more operators into +_waitingOperators_. This strategy not only ensures fairness between different +schedulers, but also encourages schedulers to generate non-conflicting +operators, thus reducing the impact on cluster performance. + +As before, if different schedulers generate operators of a same Region, the +later one is rejected directly. Dropping the later operator when it is about +to promote is also an option. + +### Configuration and customization + +There is only 1 important configuration for basic usage: + +* _MaxScheduleCost_ + + The maximum summary of cost of running operators for each store. + +We can also allow to customize following configurations to better fit special +needs. (Tests are needed to decide if each of them is worth being made +configurable): + +* _StoreMaxScheduleCost_ + + Similar to _MaxScheduleCost_, but for a specific store. If could be useful + when performance of `tikv-server`s are different. + +* _OperatorStepCost_ + + We will assign cost values based on experience. They may need adjustment + in production. + +* _MaxWaitingOperator_ + + The maximum count of waiting operators of each scheduler. + +* _SchedulerMaxWaitingOperator_ + + Similar to _MaxWaitingOperator_, but for a specific scheduler. We can use + it to control priority of different schedulers. + +## Drawbacks + +The migration from the old way to the new scheme has a certain cost, involving +the update of the deployment script and compatibility with the old cluster +configuration in the future. + +## Alternatives + +It has been considered that the schedulers use a cooperative rather than a +competitive scheme to generate operators. However, this approach is more +dependent on the better quality of implementation of all the schedulers, and +the schedulers need to perceive each other, which will increase the coupling +degree of the system. + +## Unresolved questions + +The arbitrarily determined step cost may not correctly reflect the scheduling +overhead, so we need to test to verify that it is appropriate.