Skip to content

KubernetesPodOperator can leave running pods after cleanup if on_finish_action is set to "keep_pod" #49466

@sean-rose

Description

@sean-rose

Apache Airflow version

2.10.5

If "Other Airflow 2 version" selected, which one?

No response

What happened?

When using the KubernetesPodOperator with on_finish_action set to "keep_pod", if the task fails in a way where the pod is still running at the time (e.g. the task's specified execution_timeout is exceeded, or the pod takes longer than startup_timeout_seconds to start) then the pod is left running.

What you think should happen instead?

IMO if a task fails its associated workload should always be stopped, not left running. Or failing that, it should at least be possible to easily configure the KubernetesPodOperator to clean up running pods without always deleting all pods.

If Kubernetes provided a way to stop running pods without entirely deleting them that'd be the ideal solution, but unfortunately that doesn't appear to be possible, so I see a few other options:

  1. Automatically delete running pods during cleanup.
    • This would be my preference, as I don't think it'd ever be desirable to leave pods running when the associated task has failed (e.g. due to the task execution timeout).
  2. Allow the behavior to be configured, either via on_finish_action or a new parameter specific to cleaning up running pods.
    • Though if we just add other options for on_finish_action we'd be intentionally leaving people using the existing "keep_pod" and "delete_succeeded_pod" options in the existing buggy state where running pods can be left after cleanup.
  3. Leave it up to users to implement their own pod cleanup logic via on_pod_cleanup callbacks.

I'm happy to submit a PR to implement any of these options, but would need guidance from the Kubernetes provider maintainers on what the preferred approach should be.

How to reproduce

Configure a KubernetesPodOperator with on_finish_action="keep_pod" and an execution_timeout shorter than its runtime:

KubernetesPodOperator(
    task_id='pod_task_timeout_test',
    name='pod-task-timeout-test',
    image='alpine',
    cmds=['/bin/sh'],
    arguments=['-c', 'sleep 300'],
    execution_timeout=datetime.timedelta(seconds=10),
    on_finish_action="keep_pod",
    ...
)

When the task fails due to the execution timeout the pod will be left running.

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==10.1.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetprovider:cncf-kubernetesKubernetes (k8s) provider related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions