Skip to content

Don't re-patch pods that are already controlled by current worker#26778

Merged
ephraimbuddy merged 1 commit intoapache:mainfrom
hterik:dontrepatchpods
Oct 18, 2022
Merged

Don't re-patch pods that are already controlled by current worker#26778
ephraimbuddy merged 1 commit intoapache:mainfrom
hterik:dontrepatchpods

Conversation

@hterik
Copy link
Contributor

@hterik hterik commented Sep 29, 2022

After the scheduler has launched many pods, it keeps trying to re-adopt them by patching every pod. Each patch-operation involves a remote API-call which can be be very slow. In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label, the list query-result will be shorter and the number of patch-queries much less.

We had an unlucky moment in our environment, where each patch-operation started taking 100ms each, with 200 pods in flight it accumulates into 20 seconds of blocked scheduler.

After the scheduler has launched many pods, it keeps trying to
re-adopt them by patching every pod. Each patch-operation
involves a remote API-call which can be be very slow.
In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label,
the list query-result will be shorter and the number of
patch-queries much less.

We had an unlucky moment in our environment, where each
patch-operation started taking 100ms each, with 200 pods in
flight it accumulates into 20 seconds of blocked scheduler.
@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:Scheduler including HA (high availability) scheduler labels Sep 29, 2022
@eladkal eladkal added this to the Airflow 2.4.2 milestone Oct 16, 2022
@ephraimbuddy ephraimbuddy merged commit 27ec562 into apache:main Oct 18, 2022
ephraimbuddy pushed a commit that referenced this pull request Oct 18, 2022
…6778)

After the scheduler has launched many pods, it keeps trying to
re-adopt them by patching every pod. Each patch-operation
involves a remote API-call which can be be very slow.
In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label,
the list query-result will be shorter and the number of
patch-queries much less.

We had an unlucky moment in our environment, where each
patch-operation started taking 100ms each, with 200 pods in
flight it accumulates into 20 seconds of blocked scheduler.

(cherry picked from commit 27ec562)
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Oct 18, 2022
ephraimbuddy pushed a commit that referenced this pull request Oct 18, 2022
…6778)

After the scheduler has launched many pods, it keeps trying to
re-adopt them by patching every pod. Each patch-operation
involves a remote API-call which can be be very slow.
In the meantime the scheduler can not do anything else.

By ignoring the pods that already have the expected label,
the list query-result will be shorter and the number of
patch-queries much less.

We had an unlucky moment in our environment, where each
patch-operation started taking 100ms each, with 200 pods in
flight it accumulates into 20 seconds of blocked scheduler.

(cherry picked from commit 27ec562)
@droppoint droppoint mentioned this pull request Nov 21, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes (k8s) provider related issues type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants