Be more selective when adopting pods with KubernetesExecutor#28899
Merged
ephraimbuddy merged 4 commits intoapache:mainfrom Jan 18, 2023
Merged
Be more selective when adopting pods with KubernetesExecutor#28899ephraimbuddy merged 4 commits intoapache:mainfrom
ephraimbuddy merged 4 commits intoapache:mainfrom
Conversation
When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).
This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:
ERROR - attempting to adopt taskinstance which was not specified by
database: TaskInstanceKey(...)
This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.
Closes apache#28071
dstandish
reviewed
Jan 12, 2023
| for scheduler_job_id in scheduler_job_ids: | ||
| scheduler_job_id = pod_generator.make_safe_label_value(str(scheduler_job_id)) | ||
| query_kwargs = {"label_selector": f"airflow-worker={scheduler_job_id}"} | ||
| # We will look for any pods owned by the no-longer-running scheduler, |
Contributor
There was a problem hiding this comment.
do we know that scheduler_job_ids are all not running?
Member
Author
There was a problem hiding this comment.
Yes, if you look up a few lines, we build that from the TI's SchedulerJob asks us to try and adopt. And those are TIs are tied to non-running SchedulerJobs.
dstandish
reviewed
Jan 12, 2023
| # still be in queued. | ||
| query_kwargs = { | ||
| "field_selector": "status.phase!=Succeeded", | ||
| "label_selector": f"kubernetes_executor=True,airflow-worker={scheduler_job_id}", |
Contributor
There was a problem hiding this comment.
i'll bet you there's a selector IN but... don't care :)
Member
Author
ephraimbuddy
approved these changes
Jan 18, 2023
pierrejeambrun
pushed a commit
that referenced
this pull request
Mar 6, 2023
* Be more selective when adopting pods with KubernetesExecutor
When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).
This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:
ERROR - attempting to adopt taskinstance which was not specified by
database: TaskInstanceKey(...)
This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.
Closes #28071
* Also ignore done pods
(cherry picked from commit f64ac59)
pierrejeambrun
pushed a commit
that referenced
this pull request
Mar 7, 2023
* Be more selective when adopting pods with KubernetesExecutor
When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).
This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:
ERROR - attempting to adopt taskinstance which was not specified by
database: TaskInstanceKey(...)
This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.
Closes #28071
* Also ignore done pods
(cherry picked from commit f64ac59)
pierrejeambrun
pushed a commit
that referenced
this pull request
Mar 8, 2023
* Be more selective when adopting pods with KubernetesExecutor
When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).
This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:
ERROR - attempting to adopt taskinstance which was not specified by
database: TaskInstanceKey(...)
This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.
Closes #28071
* Also ignore done pods
(cherry picked from commit f64ac59)
64 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When trying to adopt "resettable" TIs from SchedulerJob, we should not list out all the pods to compare against, only those that didn't succeed. This means we will get any pods that are still starting, running, or failed (meaning the TI wasn't moved to a terminal state there, and will be in out "adoptable" list).
This avoids the scenario where a dead scheduler has both a completed, successful worker, and a still running worker, causing log lines like these about the successful one:
This also makes sure we only find pods with the
kubernetes_executor=Truelabel for extra safety.Closes #28071