Allow for retry when tasks are stuck in queued #43520

dimberman · 2024-10-30T17:24:22Z

Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses
track of a task, a cluster can't further scale up its workers, etc.), but tasks
should not be stuck in queued for a long time.

Originally, we simply marked a task as failed when it was stuck in queued for
too long. We found that this led to suboptimal outcomes as ideally we would like "failed"
to mean that a task was unable to run, instead of it meaning that we were unable to run the task.

As a compromise between always failing a stuck task and always rescheduling a stuck task (which could
lead to tasks being stuck in queued forever without informing the user), we have creating the config
[core] num_stuck_reschedules. With this new configuration, an airflow admin can decide how
sensitive they would like their airflow to be WRT failing stuck tasks.

Here is an example of what it looks like after trying this out with celery executor

airflow/jobs/scheduler_job_runner.py

…ed_timeout`. Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses track of a task, a cluster can't further scale up its workers, etc.), but tasks should not be stuck in queued for a long time. Originally, we simply marked a task as failed when it was stuck in queued for too long. We found that this led to suboptimal outcomes as ideally we would like "failed" to mean that a task was unable to run, instead of it meaning that we were unable to run the task. As a compromise between always failing a stuck task and always rescheduling a stuck task (which could lead to tasks being stuck in queued forever without informing the user), we have creating the config `AIRFLOW__CORE__NUM_STUCK_RETRIES`. With this new configuration, an airflow admin can decide how sensitive they would like their airflow to be WRT failing stuck tasks.

airflow/jobs/scheduler_job_runner.py

providers/src/airflow/providers/celery/executors/celery_executor.py

uv.lock

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

…s since they will not have the scheduler change

…flow into handle-stuck-in-queue

airflow/jobs/scheduler_job_runner.py

…flow into handle-stuck-in-queue

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

airflow/config_templates/config.yml

…ed_timeout`. Tasks can get stuck in queued for a wide variety of reasons (e.g. celery loses track of a task, a cluster can't further scale up its workers, etc.), but tasks should not be stuck in queued for a long time. Originally, we simply marked a task as failed when it was stuck in queued for too long. We found that this led to suboptimal outcomes as ideally we would like "failed" to mean that a task was unable to run, instead of it meaning that we were unable to run the task. As a compromise between always failing a stuck task and always rescheduling a stuck task (which could lead to tasks being stuck in queued forever without informing the user), we have creating the config `AIRFLOW__CORE__NUM_STUCK_RETRIES`. With this new configuration, an airflow admin can decide how sensitive they would like their airflow to be WRT failing stuck tasks.

…flow into handle-stuck-in-queue

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

airflow/executors/base_executor.py

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

…44158) * [v2-10-test] Re-queue tassk when they are stuck in queued (#43520) The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * fix test_handle_stuck_queued_tasks_multiple_attempts (#44093) --------- Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: GPK <gopidesupavan@gmail.com>

This is a fix up / followup to #43520 It does not really make a material difference, just, I'm avoiding use of the session decorator, and the create / dispose session logic, when it is not needed. i also commit as i go along since there's no reason to handle multiple distinct tis in the same transaction.

…44158) * [v2-10-test] Re-queue tassk when they are stuck in queued (#43520) The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config. We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events. --------- (cherry picked from commit a41feeb) Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * fix test_handle_stuck_queued_tasks_multiple_attempts (#44093) --------- Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: GPK <gopidesupavan@gmail.com>

In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3.

…#53435) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>

…#53435) (#54401) In issue #51301, it was reported that failure callbacks do not run for task instances that get stuck in queued and fail in Airflow 2.10.5. This is happening due to the changes introduced in PR #43520 . In this PR, logic was introduced to requeue tasks that get stuck in queued (up to two times by default) before failing them. Previously, the executor's fail method was called when the task needed to be failed after max requeue attempts. This was replaced by the task instance's set_state method in the PR ti.set_state(TaskInstanceState.FAILED, session=session). Without the executor's fail method being called, failure callbacks will not be executed for such task instances. Therefore, I changed the code to call the executor's fail method instead in Airflow 3. (cherry picked from commit 6da77b1) Co-authored-by: Karen Braganza <karenbraganza15@gmail.com>

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Oct 30, 2024

dstandish reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

o-nikolas reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dstandish reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dstandish reviewed Oct 30, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dimberman force-pushed the handle-stuck-in-queue branch from 1f8b642 to 8eb60b1 Compare November 1, 2024 16:22

address feedback

066f672

jedcunningham reviewed Nov 1, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

providers/src/airflow/providers/celery/executors/celery_executor.py Show resolved Hide resolved

uv.lock Outdated Show resolved Hide resolved

dimberman and others added 4 commits November 1, 2024 10:42

remove uv.lock

93cb9ce

Update airflow/jobs/scheduler_job_runner.py

4a5c9e5

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

We need to ensure that older versions of airflow don't run into issue…

076c257

…s since they will not have the scheduler change

Merge branch 'handle-stuck-in-queue' of https://github.com/apache/air…

f46d4b0

…flow into handle-stuck-in-queue

jedcunningham reviewed Nov 1, 2024

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

dimberman marked this pull request as ready for review November 1, 2024 20:47

dimberman requested review from XD-DENG, ashb and hussein-awala as code owners November 1, 2024 20:47

dimberman and others added 6 commits November 1, 2024 13:47

Merge branch 'main' into handle-stuck-in-queue

4fe701e

address feedback

9987963

Merge branch 'handle-stuck-in-queue' of https://github.com/apache/air…

c39e52f

…flow into handle-stuck-in-queue

pre-commit

05fc02d

Update airflow/jobs/scheduler_job_runner.py

a1564cc

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

Merge branch 'main' into handle-stuck-in-queue

cbc3453

jedcunningham reviewed Nov 4, 2024

View reviewed changes

airflow/config_templates/config.yml Outdated Show resolved Hide resolved

dimberman and others added 3 commits November 4, 2024 09:13

Merge branch 'handle-stuck-in-queue' of https://github.com/apache/air…

1c34eea

…flow into handle-stuck-in-queue

Update airflow/config_templates/config.yml

2113384

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

dstandish mentioned this pull request Nov 4, 2024

Simplify the handle stuck in queued interface #43647

Merged

k8s support

c18f8f4

small nits

1d85dbf

dstandish merged commit a41feeb into main Nov 16, 2024

jscheffl reviewed Nov 16, 2024

View reviewed changes

airflow/executors/base_executor.py Show resolved Hide resolved

jscheffl mentioned this pull request Nov 18, 2024

[v2-10-test] Re-queue tassk when they are stuck in queued (#43520) #44158

Merged

jedcunningham deleted the handle-stuck-in-queue branch November 18, 2024 21:24

dstandish mentioned this pull request Nov 19, 2024

Don't create new session in stuck queue reschedule handler #44192

Merged

dstandish added this to the Airflow 2.10.4 milestone Nov 19, 2024

eladkal mentioned this pull request Nov 24, 2024

Status of testing Providers that were prepared on November 24, 2024 #44324

Closed

22 tasks

utkarsharma2 mentioned this pull request Dec 10, 2024

Status of testing of Apache Airflow 2.10.4rc1 #44811

Closed

33 tasks

karenbraganz mentioned this pull request Dec 23, 2024

Allow internal retries when pending k8s pod is deleted #45184

Merged

potiuk mentioned this pull request Feb 21, 2025

Status of testing Providers that were prepared on February 21, 2025 #46973

Closed

sean-rose mentioned this pull request Mar 26, 2025

feat: Upgrade Airflow from 2.9.3 to 2.10.5 (DENG-5881) mozilla/telemetry-airflow#2179

Merged

eladkal mentioned this pull request Apr 27, 2025

kubernetes_executor does not properly fail task-instances in cleanup_stuck_queued_tasks #39078

Closed

2 tasks

karenbraganz mentioned this pull request Jun 2, 2025

Issue with task_queued_timeout Causing Silent Task Failures in Airflow 2.10.5 #51301

Open

2 tasks

collinmcnulty mentioned this pull request Jun 4, 2025

Correctly treat requeues on reschedule sensors as resetting after each reschedule #51410

Merged

karenbraganz mentioned this pull request Jul 8, 2025

[v2-11-test] Allow failure callbacks for stuck in queued TIs that fail #53038

Open

karenbraganz mentioned this pull request Jul 16, 2025

Allow failure callbacks for stuck in queued TIs that fail #53435

Merged

ashb mentioned this pull request Aug 12, 2025

[v3-0-test] Allow failure callbacks for stuck in queued TIs that fail (#53435) #54401

Merged

kaxil mentioned this pull request Aug 13, 2025

Status of testing of Apache Airflow 3.0.5rc1 & Task SDK 1.0.5rc1 #54476

Closed

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for retry when tasks are stuck in queued #43520

Allow for retry when tasks are stuck in queued #43520

dimberman commented Oct 30, 2024 •

edited by dstandish

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Allow for retry when tasks are stuck in queued #43520

Allow for retry when tasks are stuck in queued #43520

Conversation

dimberman commented Oct 30, 2024 • edited by dstandish Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dimberman commented Oct 30, 2024 •

edited by dstandish

Loading