Don't re-patch pods that are already controlled by current worker#26778
Merged
ephraimbuddy merged 1 commit intoapache:mainfrom Oct 18, 2022
Merged
Don't re-patch pods that are already controlled by current worker#26778ephraimbuddy merged 1 commit intoapache:mainfrom
ephraimbuddy merged 1 commit intoapache:mainfrom
Conversation
After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else. By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less. We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler.
uranusjr
approved these changes
Oct 6, 2022
ephraimbuddy
pushed a commit
that referenced
this pull request
Oct 18, 2022
…6778) After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else. By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less. We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler. (cherry picked from commit 27ec562)
ephraimbuddy
pushed a commit
that referenced
this pull request
Oct 18, 2022
…6778) After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else. By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less. We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler. (cherry picked from commit 27ec562)
37 tasks
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else.
By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less.
We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler.