Skip to content

Conversation

@pankajastro
Copy link
Member

@pankajastro pankajastro commented May 13, 2023

In the case of multiple schedulers lots of tasks running
If schedulers restart a bit quickly some cases it sets the wrong task status.
In this PR, I'm changing some checks so that if the pod status is non-terminal
then set the task status to Failed only if the deletion_timestamp (metadata.deletion_timestamp) is set for the pod. This gets set when server receive the deletion request https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/object-meta/#ObjectMeta

Example screenshot before changes
Screenshot 2023-05-14 at 2 55 02 AM

Example screenshot after changes
Screenshot 2023-05-14 at 2 49 30 AM


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:Scheduler including HA (high availability) scheduler labels May 13, 2023
@pankajastro pankajastro force-pushed the fix_kube_job_watcher branch 2 times, most recently from 3e022c4 to 03dc333 Compare May 14, 2023 12:29
@pankajastro pankajastro marked this pull request as ready for review May 14, 2023 13:28
@pankajastro pankajastro force-pushed the fix_kube_job_watcher branch from 33807d8 to b2d280f Compare May 17, 2023 09:44
@eladkal eladkal added this to the Airflow 2.6.2 milestone May 18, 2023
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label May 18, 2023
Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You can test this on a deployment. Please revert when tested as I would like this to be part of 2.6.2 which we will cut tomorrow

@pankajastro pankajastro force-pushed the fix_kube_job_watcher branch 2 times, most recently from 7c92d79 to 8006ae9 Compare June 7, 2023 09:30
@eladkal eladkal modified the milestones: Airflow 2.6.2, Airlfow 2.6.3 Jun 8, 2023
In the case of multiple schedulers and lots of tasks running
If somehow schedulers restart and try to adopt pods
in some cases, it sets the wrong task status.
In this PR, I'm changing some checks so that if the pod status is non-terminal
then set the task status Failed only
if the pod event type is DELETED and POD_EXECUTOR_DONE_KEY is in the pod label
@pankajastro pankajastro force-pushed the fix_kube_job_watcher branch from 2a519d2 to c8fe9e0 Compare June 8, 2023 08:02
@pankajastro
Copy link
Member Author

LGTM. You can test this on a deployment. Please revert when tested as I would like this to be part of 2.6.2 which we will cut tomorrow

Tested it

@ephraimbuddy ephraimbuddy merged commit dfbf529 into apache:main Jun 8, 2023
@ephraimbuddy ephraimbuddy deleted the fix_kube_job_watcher branch June 8, 2023 12:55
@eladkal eladkal modified the milestones: Airlfow 2.6.3, Airflow 2.6.2 Jun 8, 2023
eladkal pushed a commit that referenced this pull request Jun 8, 2023
* Fix Kubernetes executor set wrong task status

In the case of multiple schedulers and lots of tasks running
If somehow schedulers restart and try to adopt pods
in some cases, it sets the wrong task status.
In this PR, I'm changing some checks so that if the pod status is non-terminal
then set the task status Failed only
if the pod event type is DELETED and POD_EXECUTOR_DONE_KEY is in the pod label

* cleanup

(cherry picked from commit dfbf529)
potiuk pushed a commit that referenced this pull request Jun 9, 2023
* Fix Kubernetes executor set wrong task status

In the case of multiple schedulers and lots of tasks running
If somehow schedulers restart and try to adopt pods
in some cases, it sets the wrong task status.
In this PR, I'm changing some checks so that if the pod status is non-terminal
then set the task status Failed only
if the pod event type is DELETED and POD_EXECUTOR_DONE_KEY is in the pod label

* cleanup

(cherry picked from commit dfbf529)
@dirrao
Copy link
Contributor

dirrao commented Jul 23, 2023

#31198 issue fixed in this MR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes (k8s) provider related issues type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants