Skip to content

Task is retried when it is setting failed manually #36471

@antoniocorralsierra

Description

@antoniocorralsierra

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.7.2

What happened?

I have a DAG where there are some tasks using KubernetesPodOperator and running in parallel. The tasks are configurate with 5 retries.

When I mark DagRun as failed from UI (calling /dagrun_failed) or using the API endpoint (https://airflow.apache.org/api/v1/dags/{dag_id}/dagRuns/{dag_run_id}) if the tasks are running, they are marked as failed but after that their status change to up_for_retry.

The task log is the following:
[2023-12-28, 10:21:43 UTC] {pod_manager.py:351} WARNING - Pod not yet started: clientes-wog3wa6y
[2023-12-28, 10:21:43 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: 10002860.Extraction.clientes manual__2023-12-27T11:17:15.812101+00:00 [failed]> from DB
[2023-12-28, 10:21:43 UTC] {local_task_job_runner.py:294} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2023-12-28, 10:21:43 UTC] {job.py:216} DEBUG - [heartbeat]
[2023-12-28, 10:21:43 UTC] {process_utils.py:131} INFO - Sending 15 to group 632. PIDs of all processes in the group: [632]
[2023-12-28, 10:21:43 UTC] {process_utils.py:86} INFO - Sending the signal 15 to group 632
[2023-12-28, 10:21:43 UTC] {taskinstance.py:1632} ERROR - Received SIGTERM. Terminating subprocesses.
[2023-12-28, 10:21:46 UTC] {taskinstance.py:1937} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 597, in execute_sync
self.await_pod_start(pod=self.pod)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 358, in await_pod_start
time.sleep(1)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 1634, in signal_handler
raise AirflowException("Task received SIGTERM signal")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 714, in cleanup
istio_enabled = self.is_istio_enabled(remote_pod)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found

The problem happen when the task is mark as failed while the kubernetesPodOperator is waiting that the pod reach other phase than Pending.

The same behaviour is seen when a unique task, using KubernetesPodOperator, is mark as failed while it is running but the pod is in pending status.

PD: I use CeleryKubernetesExecutor and the tasks are running on CeleryExecutor.

What you think should happen instead?

When I mark DagRun or task as failed manually, it should fail without retries.

How to reproduce

There are two ways:

  1. Setting failed state manually from UI to a task that uses KubernetesPodOperator and is running but the pod of the task is in Pending status.
  2. Setting DagRun as failed from UI (calling /dagrun_failed) or using the API endpoint (https://airflow.apache.org/api/v1/dags/{dag_id}/dagRuns/{dag_run_id})

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-celery 3.3.4
apache-airflow-providers-cncf-kubernetes 7.6.0
apache-airflow-providers-redis 3.3.2

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions