Skip to content

server/schedulers: introduce evict-slow-store-scheduler #3922

Merged
ti-chi-bot merged 27 commits into
tikv:masterfrom
5kbpers:slow-node
Aug 6, 2021
Merged

server/schedulers: introduce evict-slow-store-scheduler #3922
ti-chi-bot merged 27 commits into
tikv:masterfrom
5kbpers:slow-node

Conversation

@5kbpers
Copy link
Copy Markdown
Member

@5kbpers 5kbpers commented Jul 29, 2021

What problem does this PR solve?

For addressing tikv/tikv#10539, we will introduce a slow score concept into tikv and report it to pd in store heartbeats. We then implement evict-slow-store-scheduler to detect and evict all leaders from one store that was considered slow.

Blocked by pingcap/kvproto#789

What is changed and how it works?

  • implement evict-slow-store-scheduler
  • show slow score in store info

Check List

Tests

  • Unit test
  • Integration test

Code changes

  • Has persistent data change

Related changes

  • Need to cherry-pick to the release branch

Release note

introduce evict-slow-store-scheduler for detecting and evicting slow stores

5kbpers added 7 commits July 22, 2021 14:58
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
This reverts commit e7f8021.

Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
@ti-chi-bot
Copy link
Copy Markdown
Member

ti-chi-bot commented Jul 29, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • disksing
  • rleungx

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Details

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jul 29, 2021
@ti-chi-bot ti-chi-bot requested review from disksing and rleungx July 29, 2021 08:51
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 29, 2021

Codecov Report

Merging #3922 (6451ef4) into master (b7ab40c) will decrease coverage by 0.06%.
The diff coverage is 61.67%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3922      +/-   ##
==========================================
- Coverage   74.65%   74.58%   -0.07%     
==========================================
  Files         246      247       +1     
  Lines       25135    25276     +141     
==========================================
+ Hits        18764    18852      +88     
- Misses       4714     4748      +34     
- Partials     1657     1676      +19     
Flag Coverage Δ
unittests 74.58% <61.67%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/api/scheduler.go 48.58% <0.00%> (-1.13%) ⬇️
server/cluster/cluster.go 82.33% <0.00%> (-0.20%) ⬇️
server/handler.go 43.38% <0.00%> (-0.11%) ⬇️
server/statistics/store_collection.go 93.66% <33.33%> (-2.03%) ⬇️
server/schedulers/evict_slow_store.go 52.22% <52.22%> (ø)
server/core/store.go 80.00% <63.63%> (-1.52%) ⬇️
server/schedulers/evict_leader.go 74.50% <84.61%> (+0.12%) ⬆️
server/api/store.go 65.24% <100.00%> (+0.10%) ⬆️
server/core/basic_cluster.go 91.27% <100.00%> (+0.31%) ⬆️
server/core/store_option.go 100.00% <100.00%> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0c9916...6451ef4. Read the comment docs.

Signed-off-by: 5kbpers <tangminghua@pingcap.com>
@disksing
Copy link
Copy Markdown
Contributor

Hi, I have some concerns about it:

  • Can it really solve the problems of poor tikv performance, considering that even if the leader is removed, follower writes may still bring other problems when the speed is too slow.
  • What should be the behavior if all tikv scores are low?
  • The condition whether it is slow or not should be added to the stateFilter, otherwise other schedulers will conflict with evict-slow-scheduler.

@5kbpers
Copy link
Copy Markdown
Member Author

5kbpers commented Aug 2, 2021

Hi, I have some concerns about it:

  • Can it really solve the problems of poor tikv performance, considering that even if the leader is removed, follower writes may still bring other problems when the speed is too slow.
  • What should be the behavior if all tikv scores are low?
  • The condition whether it is slow or not should be added to the stateFilter, otherwise other schedulers will conflict with evict-slow-scheduler.
  • Currently we can only handle the situation that only one tikv gets slow down. Then the majority of replicas could have the ability to continue to serve requests without serious regression in most cases
  • If most of TiKVs get slow down, the only thing we can do is alerting the user and handle it manually

5kbpers added 5 commits August 3, 2021 14:41
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Comment thread go.mod Outdated
Comment thread server/cluster/cluster.go Outdated
Comment thread server/cluster/cluster.go Outdated
Comment thread server/schedulers/evict_slow_store.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we can evict leader from only 1 slow store, why use a vector to store the ID>

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there would be a chance to introduce a more complex strategy for this scheduler. So use a vector here for keeping flexibility.

Comment thread server/schedulers/evict_slow_store.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not a very good idea to use another scheduler as a toolkit because each scheduler has its own context. For example, scheduler names will be recorded to the filter metrics, it may happen that we want to record evict-slow-store-scheduler and end up with evict-leader-scheduler.
There are two possible ways to improve this: a) reimplement the similar code again b) extract the common parts and let different schedulers call them

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is basically the same so I think it would be better to reuse it directly. About the metrics, I saw that it is based on the name of the scheduler, may setting a customized name is a good solution?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting the common part?

Signed-off-by: 5kbpers <tangminghua@pingcap.com>
5kbpers added 4 commits August 5, 2021 14:54
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
@lhy1024 lhy1024 added the priority/P1 The issue has P1 priority. label Aug 5, 2021
@lhy1024
Copy link
Copy Markdown
Contributor

lhy1024 commented Aug 5, 2021

This pr will release in sprint4, PTAL

// NewEvictSlowStoreSchedulerCommand returns a command to add a evict-slow-store-scheduler.
func NewEvictSlowStoreSchedulerCommand() *cobra.Command {
c := &cobra.Command{
Use: "evict-slow-store-scheduler",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schedule should add store_id param???

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it just needs the id of the previous evicted store which was stored in etcd.

PauseLeaderTransfer(id uint64) error
ResumeLeaderTransfer(id uint64)

SlowStoreEvicted(id uint64) error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse PauseLeaderTransfer/ResumeLeaderTransfer?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will make this scheduler conflicts with evict-leader-scheduler

5kbpers added 2 commits August 6, 2021 14:08
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
Comment thread server/schedulers/evict_slow_store.go Outdated
Comment thread server/schedulers/evict_slow_store.go Outdated
5kbpers added 4 commits August 6, 2021 18:08
Comment thread server/schedule/filter/filters.go Outdated
Comment thread server/schedulers/evict_leader.go Outdated
Comment thread server/schedulers/evict_leader.go Outdated
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 6, 2021
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Aug 6, 2021
@5kbpers
Copy link
Copy Markdown
Member Author

5kbpers commented Aug 6, 2021

/merge

@ti-chi-bot
Copy link
Copy Markdown
Member

@5kbpers: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Copy Markdown
Member

@5kbpers: /merge is only allowed for the committers, you can assign this pull request to the committer in list by filling /assign @committer in the comment to help merge this pull request.

Details

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@disksing
Copy link
Copy Markdown
Contributor

disksing commented Aug 6, 2021

/merge

@ti-chi-bot
Copy link
Copy Markdown
Member

@disksing: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Copy Markdown
Member

This pull request has been accepted and is ready to merge.

DetailsCommit hash: 6451ef4

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Aug 6, 2021
@ti-chi-bot ti-chi-bot merged commit 21fef03 into tikv:master Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/scheduler Scheduler logic. priority/P1 The issue has P1 priority. release-note Denotes a PR that will be considered when it comes time to generate release notes. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants