-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.7.2
What happened?
I have a DAG where there are some tasks using KubernetesPodOperator and running in parallel. The tasks are configurate with 5 retries.
When I mark DagRun as failed from UI (calling /dagrun_failed) or using the API endpoint (https://airflow.apache.org/api/v1/dags/{dag_id}/dagRuns/{dag_run_id}) if the tasks are running, they are marked as failed but after that their status change to up_for_retry.
The task log is the following:
[2023-12-28, 10:21:43 UTC] {pod_manager.py:351} WARNING - Pod not yet started: clientes-wog3wa6y
[2023-12-28, 10:21:43 UTC] {taskinstance.py:844} DEBUG - Refreshing TaskInstance <TaskInstance: 10002860.Extraction.clientes manual__2023-12-27T11:17:15.812101+00:00 [failed]> from DB
[2023-12-28, 10:21:43 UTC] {local_task_job_runner.py:294} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2023-12-28, 10:21:43 UTC] {job.py:216} DEBUG - [heartbeat]
[2023-12-28, 10:21:43 UTC] {process_utils.py:131} INFO - Sending 15 to group 632. PIDs of all processes in the group: [632]
[2023-12-28, 10:21:43 UTC] {process_utils.py:86} INFO - Sending the signal 15 to group 632
[2023-12-28, 10:21:43 UTC] {taskinstance.py:1632} ERROR - Received SIGTERM. Terminating subprocesses.
[2023-12-28, 10:21:46 UTC] {taskinstance.py:1937} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 597, in execute_sync
self.await_pod_start(pod=self.pod)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 358, in await_pod_start
time.sleep(1)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 1634, in signal_handler
raise AirflowException("Task received SIGTERM signal")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 714, in cleanup
istio_enabled = self.is_istio_enabled(remote_pod)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
The problem happen when the task is mark as failed while the kubernetesPodOperator is waiting that the pod reach other phase than Pending.
The same behaviour is seen when a unique task, using KubernetesPodOperator, is mark as failed while it is running but the pod is in pending status.
PD: I use CeleryKubernetesExecutor and the tasks are running on CeleryExecutor.
What you think should happen instead?
When I mark DagRun or task as failed manually, it should fail without retries.
How to reproduce
There are two ways:
- Setting failed state manually from UI to a task that uses KubernetesPodOperator and is running but the pod of the task is in Pending status.
- Setting DagRun as failed from UI (calling /dagrun_failed) or using the API endpoint (https://airflow.apache.org/api/v1/dags/{dag_id}/dagRuns/{dag_run_id})
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-celery 3.3.4
apache-airflow-providers-cncf-kubernetes 7.6.0
apache-airflow-providers-redis 3.3.2
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct