-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Apache Airflow version
2.10.5
If "Other Airflow 2 version" selected, which one?
No response
What happened?
When using the KubernetesPodOperator with on_finish_action set to "keep_pod", if the task fails in a way where the pod is still running at the time (e.g. the task's specified execution_timeout is exceeded, or the pod takes longer than startup_timeout_seconds to start) then the pod is left running.
What you think should happen instead?
IMO if a task fails its associated workload should always be stopped, not left running. Or failing that, it should at least be possible to easily configure the KubernetesPodOperator to clean up running pods without always deleting all pods.
If Kubernetes provided a way to stop running pods without entirely deleting them that'd be the ideal solution, but unfortunately that doesn't appear to be possible, so I see a few other options:
- Automatically delete running pods during cleanup.
- This would be my preference, as I don't think it'd ever be desirable to leave pods running when the associated task has failed (e.g. due to the task execution timeout).
- Allow the behavior to be configured, either via
on_finish_actionor a new parameter specific to cleaning up running pods.- Though if we just add other options for
on_finish_actionwe'd be intentionally leaving people using the existing"keep_pod"and"delete_succeeded_pod"options in the existing buggy state where running pods can be left after cleanup.
- Though if we just add other options for
- Leave it up to users to implement their own pod cleanup logic via
on_pod_cleanupcallbacks.- This isn't currently feasible since
on_pod_cleanupcallbacks aren't called for failed tasks, but I've submitted HaveKubernetesPodOperatorcallon_pod_cleanupcallbacks for failed tasks #49441 to change that behavior.
- This isn't currently feasible since
I'm happy to submit a PR to implement any of these options, but would need guidance from the Kubernetes provider maintainers on what the preferred approach should be.
How to reproduce
Configure a KubernetesPodOperator with on_finish_action="keep_pod" and an execution_timeout shorter than its runtime:
KubernetesPodOperator(
task_id='pod_task_timeout_test',
name='pod-task-timeout-test',
image='alpine',
cmds=['/bin/sh'],
arguments=['-c', 'sleep 300'],
execution_timeout=datetime.timedelta(seconds=10),
on_finish_action="keep_pod",
...
)When the task fails due to the execution timeout the pod will be left running.
Operating System
Debian GNU/Linux 12 (bookworm)
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==10.1.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct