Be more selective when adopting pods with KubernetesExecutor by jedcunningham · Pull Request #28899 · apache/airflow

jedcunningham · 2023-01-12T20:15:22Z

When trying to adopt "resettable" TIs from SchedulerJob, we should not list out all the pods to compare against, only those that didn't succeed. This means we will get any pods that are still starting, running, or failed (meaning the TI wasn't moved to a terminal state there, and will be in out "adoptable" list).

This avoids the scenario where a dead scheduler has both a completed, successful worker, and a still running worker, causing log lines like these about the successful one:

ERROR - attempting to adopt taskinstance which was not specified by database: TaskInstanceKey(...)

This also makes sure we only find pods with the
kubernetes_executor=True label for extra safety.

Closes #28071

When trying to adopt "resettable" TIs from SchedulerJob, we should not list out all the pods to compare against, only those that didn't succeed. This means we will get any pods that are still starting, running, or failed (meaning the TI wasn't moved to a terminal state there, and will be in out "adoptable" list). This avoids the scenario where a dead scheduler has both a completed, successful worker, and a still running worker, causing log lines like these about the successful one: ERROR - attempting to adopt taskinstance which was not specified by database: TaskInstanceKey(...) This also makes sure we only find pods with the `kubernetes_executor=True` label for extra safety. Closes apache#28071

dstandish · 2023-01-12T20:51:44Z

airflow/executors/kubernetes_executor.py

        for scheduler_job_id in scheduler_job_ids:
            scheduler_job_id = pod_generator.make_safe_label_value(str(scheduler_job_id))
-            query_kwargs = {"label_selector": f"airflow-worker={scheduler_job_id}"}
+            # We will look for any pods owned by the no-longer-running scheduler,


do we know that scheduler_job_ids are all not running?

Yes, if you look up a few lines, we build that from the TI's SchedulerJob asks us to try and adopt. And those are TIs are tied to non-running SchedulerJobs.

dstandish · 2023-01-12T20:52:49Z

airflow/executors/kubernetes_executor.py

+            # still be in queued.
+            query_kwargs = {
+                "field_selector": "status.phase!=Succeeded",
+                "label_selector": f"kubernetes_executor=True,airflow-worker={scheduler_job_id}",


i'll bet you there's a selector IN but... don't care :)

I was going to say "actually, no", but turns out there is for labels but not for fields.

Future work!

* Be more selective when adopting pods with KubernetesExecutor When trying to adopt "resettable" TIs from SchedulerJob, we should not list out all the pods to compare against, only those that didn't succeed. This means we will get any pods that are still starting, running, or failed (meaning the TI wasn't moved to a terminal state there, and will be in out "adoptable" list). This avoids the scenario where a dead scheduler has both a completed, successful worker, and a still running worker, causing log lines like these about the successful one: ERROR - attempting to adopt taskinstance which was not specified by database: TaskInstanceKey(...) This also makes sure we only find pods with the `kubernetes_executor=True` label for extra safety. Closes #28071 * Also ignore done pods (cherry picked from commit f64ac59)

jedcunningham added the type:bug-fix Changelog: Bug Fixes label Jan 12, 2023

jedcunningham added this to the Airflow 2.5.2 milestone Jan 12, 2023

jedcunningham requested a review from dstandish as a code owner January 12, 2023 20:15

boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:Scheduler including HA (high availability) scheduler labels Jan 12, 2023

dstandish reviewed Jan 12, 2023

View reviewed changes

jedcunningham and others added 3 commits January 12, 2023 14:32

Also ignore done pods

c6c164d

Merge branch 'main' into better_logging_adoption

2252f1e

Merge branch 'main' into better_logging_adoption

d9a8b71

ephraimbuddy approved these changes Jan 18, 2023

View reviewed changes

ephraimbuddy merged commit f64ac59 into apache:main Jan 18, 2023

ephraimbuddy deleted the better_logging_adoption branch January 18, 2023 20:05

pierrejeambrun mentioned this pull request Mar 10, 2023

Status of testing of Apache Airflow 2.5.2rc2 #30028

Closed

64 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be more selective when adopting pods with KubernetesExecutor#28899

Be more selective when adopting pods with KubernetesExecutor#28899
ephraimbuddy merged 4 commits intoapache:mainfrom
astronomer:better_logging_adoption

jedcunningham commented Jan 12, 2023

Uh oh!

dstandish Jan 12, 2023

Uh oh!

jedcunningham Jan 12, 2023

Uh oh!

dstandish Jan 12, 2023

Uh oh!

jedcunningham Jan 12, 2023

Uh oh!

dstandish Jan 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jedcunningham commented Jan 12, 2023

Uh oh!

dstandish Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

jedcunningham Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

dstandish Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

jedcunningham Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

dstandish Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants