Skip to content

Be more selective when adopting pods with KubernetesExecutor#28899

Merged
ephraimbuddy merged 4 commits intoapache:mainfrom
astronomer:better_logging_adoption
Jan 18, 2023
Merged

Be more selective when adopting pods with KubernetesExecutor#28899
ephraimbuddy merged 4 commits intoapache:mainfrom
astronomer:better_logging_adoption

Conversation

@jedcunningham
Copy link
Member

When trying to adopt "resettable" TIs from SchedulerJob, we should not list out all the pods to compare against, only those that didn't succeed. This means we will get any pods that are still starting, running, or failed (meaning the TI wasn't moved to a terminal state there, and will be in out "adoptable" list).

This avoids the scenario where a dead scheduler has both a completed, successful worker, and a still running worker, causing log lines like these about the successful one:

ERROR - attempting to adopt taskinstance which was not specified by database: TaskInstanceKey(...)

This also makes sure we only find pods with the
kubernetes_executor=True label for extra safety.

Closes #28071

When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).

This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:

    ERROR - attempting to adopt taskinstance which was not specified by
    database: TaskInstanceKey(...)

This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.

Closes apache#28071
@jedcunningham jedcunningham added the type:bug-fix Changelog: Bug Fixes label Jan 12, 2023
@jedcunningham jedcunningham added this to the Airflow 2.5.2 milestone Jan 12, 2023
@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:Scheduler including HA (high availability) scheduler labels Jan 12, 2023
for scheduler_job_id in scheduler_job_ids:
scheduler_job_id = pod_generator.make_safe_label_value(str(scheduler_job_id))
query_kwargs = {"label_selector": f"airflow-worker={scheduler_job_id}"}
# We will look for any pods owned by the no-longer-running scheduler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know that scheduler_job_ids are all not running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you look up a few lines, we build that from the TI's SchedulerJob asks us to try and adopt. And those are TIs are tied to non-running SchedulerJobs.

# still be in queued.
query_kwargs = {
"field_selector": "status.phase!=Succeeded",
"label_selector": f"kubernetes_executor=True,airflow-worker={scheduler_job_id}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll bet you there's a selector IN but... don't care :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say "actually, no", but turns out there is for labels but not for fields.

Future work!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃

@ephraimbuddy ephraimbuddy merged commit f64ac59 into apache:main Jan 18, 2023
@ephraimbuddy ephraimbuddy deleted the better_logging_adoption branch January 18, 2023 20:05
pierrejeambrun pushed a commit that referenced this pull request Mar 6, 2023
* Be more selective when adopting pods with KubernetesExecutor

When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).

This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:

    ERROR - attempting to adopt taskinstance which was not specified by
    database: TaskInstanceKey(...)

This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.

Closes #28071

* Also ignore done pods

(cherry picked from commit f64ac59)
pierrejeambrun pushed a commit that referenced this pull request Mar 7, 2023
* Be more selective when adopting pods with KubernetesExecutor

When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).

This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:

    ERROR - attempting to adopt taskinstance which was not specified by
    database: TaskInstanceKey(...)

This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.

Closes #28071

* Also ignore done pods

(cherry picked from commit f64ac59)
pierrejeambrun pushed a commit that referenced this pull request Mar 8, 2023
* Be more selective when adopting pods with KubernetesExecutor

When trying to adopt "resettable" TIs from SchedulerJob, we should not
list out all the pods to compare against, only those that didn't
succeed. This means we will get any pods that are still starting,
running, or failed (meaning the TI wasn't moved to a terminal state
there, and will be in out "adoptable" list).

This avoids the scenario where a dead scheduler has both a completed,
successful worker, and a still running worker, causing log lines
like these about the successful one:

    ERROR - attempting to adopt taskinstance which was not specified by
    database: TaskInstanceKey(...)

This also makes sure we only find pods with the
`kubernetes_executor=True` label for extra safety.

Closes #28071

* Also ignore done pods

(cherry picked from commit f64ac59)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes (k8s) provider related issues type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kubernetes logging errors - attempting to adopt taskinstance which was not specified by database

3 participants