Fix scheduler processing of cleared running tasks stuck in `RESTARTING` state #55084

kaxil · 2025-08-29T22:08:01Z

Clearing a DAG with running tasks causes them to get stuck in RESTARTING state indefinitely, with continuous heartbeat timeout events. This is a regression in Airflow 3.x where the scheduler fails to process executor completion events for RESTARTING tasks. This change also preserves logs.

Error Symptoms

Tasks remain in RESTARTING state after clearing
Continuous heartbeat timeout events: 404 Not Found or 409 Conflict
Tasks never transition to retry or completion
Executor reports SUCCESS but task state doesn't update

Root Cause

The scheduler's _process_executor_events() function has two gatekeeping mechanisms that filter which executor events get processed:

First filter (lines 769-775): Only processes events for certain states
Second filter (lines 859-864): Validates task instance state matches executor event

Both filters were missing TaskInstanceState.RESTARTING, causing the scheduler to ignore when tasks in RESTARTING state were successfully terminated by the executor.

Historical Context

The RESTARTING state was introduced in Airflow 2.2.0 (PR #16681) to handle cleared running tasks gracefully:

Prevent cleared tasks from transitioning to FAILED (like SHUTDOWN did)
Allow tasks to be retried instead of marked as failed
Originally relied on "zombie detection" mechanism in Airflow 2.x

In Airflow 3.x, the zombie detection mechanism was removed with the transition from Job-based to Task SDK architecture, but the scheduler was never updated to handle RESTARTING tasks in executor events.

Why not immediately set state to None

The RESTARTING state serves a critical purpose in the clearing workflow:

Race condition prevention: When clear_task_instances() is called, it immediately:
- Generates a new ti.id via prepare_db_for_next_try()
- Sets state to RESTARTING
- Commits it to DB
Running task process continues: The task process keeps running with the old ti.id and attempts heartbeating, which fails with 404 Not Found since the database now has a new ti.id.
Executor termination: The executor eventually kills the running process and reports SUCCESS for the terminated task.
Scheduler coordination: The RESTARTING state signals to the scheduler that this task was intentionally cleared and should be retried (not failed) when the executor confirms termination.

Setting the state directly to None during clearing would lose this coordination mechanism and could lead to race conditions where the scheduler processes a "completed" task before the executor has terminated the running process.

Future Considerations

This fix resolves the immediate regression but raises design questions for future improvements:

Callbacks: Should on_retry_callback be triggered for cleared tasks?
Email notifications: Should email_on_retry be sent?

These questions are deferred to maintain scope and can be addressed in follow-up discussions if needed.

Logs are preserved with this change:

… state Clearing a DAG with running tasks causes them to get stuck in RESTARTING state indefinitely with continuous heartbeat timeout events. This was a regression in Airflow 3.x where the scheduler failed to process executor completion events for RESTARTING tasks. The scheduler's _process_executor_events() function has two gatekeeping mechanisms that filter executor events, both of which were missing TaskInstanceState.RESTARTING. This caused the scheduler to ignore when tasks in RESTARTING state were successfully terminated by the executor. Changes: - Add RESTARTING to executor event processing filters in scheduler - Add special handling for successful termination of RESTARTING tasks - Set RESTARTING tasks to None (scheduled) state with adjusted max_tries - Update task lifecycle diagram to reflect RESTARTING → None transition - Add regression test for max_tries edge case with proper mock spec This ensures cleared running tasks can retry immediately without getting stuck, consistent with clearing behavior for non-running tasks. Fixes apache#55045

rawwar · 2025-08-30T04:21:42Z

Tested this locally and it works.

ashb

This will work, but is there any way we could do this without relying on executor events in the future?

eladkal · 2025-08-30T07:24:40Z

Should on_retry_callback be triggered for cleared tasks?

In Airflow 2 when I cleared a task I did not expect on_retry_callback to be executed. I don't think I've seen anyone expecting this. Why would this be an issue specifically now in Airflow 3?

kaxil · 2025-08-30T08:39:27Z

This will work, but is there any way we could do this without relying on executor events in the future?

Yeah, possible to improve in future. Currently, we mainly need it to only run after the current task attempt is killed. Otherwise they could run in parallel i.e new attempt can start running before the first one is fully terminated.

github-actions · 2025-08-30T08:40:20Z

Backport failed to create: v3-0-test. View the failure log Run details

Status	Branch	Result
❌	v3-0-test

You can attempt to backport this manually by running:

cherry_picker f9adaad v3-0-test

This should apply the commit to the v3-0-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

kaxil · 2025-08-30T08:40:51Z

Should on_retry_callback be triggered for cleared tasks?

In Airflow 2 when I cleared a task I did not expect on_retry_callback to be executed. I don't think I've seen anyone expecting this. Why would this be an issue specifically now in Airflow 3?

Mainly cause in 2.x, it first went to UP_FOR_RETRY state after RESTARTING so on_retry_callback could have been executed if set.

… state (apache#55084)

kaxil requested review from XD-DENG and ashb as code owners August 29, 2025 22:08

boring-cyborg bot added area:Scheduler including HA (high availability) scheduler kind:documentation labels Aug 29, 2025

kaxil requested review from ephraimbuddy and rawwar August 29, 2025 22:08

rawwar approved these changes Aug 30, 2025

View reviewed changes

eladkal added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label Aug 30, 2025

eladkal added this to the Airflow 3.0.7 milestone Aug 30, 2025

eladkal approved these changes Aug 30, 2025

View reviewed changes

ashb approved these changes Aug 30, 2025

View reviewed changes

kaxil merged commit f9adaad into apache:main Aug 30, 2025
62 checks passed

kaxil deleted the fix-clearing-when-running branch August 30, 2025 08:39

mangal-vairalkar pushed a commit to mangal-vairalkar/airflow that referenced this pull request Aug 30, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

9af02b0

… state (apache#55084)

vatsrahul1001 mentioned this pull request Sep 2, 2025

Odd/Incorrect UI Behavior While Restarting Long Running Task #54156

Closed

2 tasks

nothingmin pushed a commit to nothingmin/airflow that referenced this pull request Sep 2, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

63bfd63

… state (apache#55084)

kaxil modified the milestones: Airflow 3.0.7, Airflow 3.1.0 Sep 13, 2025

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Sep 30, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

3dd287a

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 1, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

b370fd1

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 2, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

91abf80

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 3, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

d2eb63d

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 4, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

ef3e63e

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

cf8aef8

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

7043fbf

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 7, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

0170954

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 8, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

f34c35f

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 9, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

5d8051f

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 10, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

fe45900

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 11, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

3fb8d35

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 12, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

fb3d949

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 14, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

accaa1a

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 15, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

0aa281f

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 17, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

cc1fcaa

… state (apache#55084)

abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 19, 2025

Fix scheduler processing of cleared running tasks stuck in RESTARTING…

bf9e658

… state (apache#55084)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scheduler processing of cleared running tasks stuck in `RESTARTING` state #55084

Fix scheduler processing of cleared running tasks stuck in `RESTARTING` state #55084

Uh oh!

kaxil commented Aug 29, 2025 •

edited

Loading

Uh oh!

rawwar commented Aug 30, 2025

Uh oh!

ashb left a comment

Uh oh!

eladkal commented Aug 30, 2025

Uh oh!

kaxil commented Aug 30, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 30, 2025

Uh oh!

kaxil commented Aug 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix scheduler processing of cleared running tasks stuck in RESTARTING state #55084

Fix scheduler processing of cleared running tasks stuck in RESTARTING state #55084

Uh oh!

Conversation

kaxil commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Error Symptoms

Root Cause

Historical Context

Why not immediately set state to None

Future Considerations

Uh oh!

rawwar commented Aug 30, 2025

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

eladkal commented Aug 30, 2025

Uh oh!

kaxil commented Aug 30, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 30, 2025

Backport failed to create: v3-0-test. View the failure log Run details

Uh oh!

kaxil commented Aug 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix scheduler processing of cleared running tasks stuck in `RESTARTING` state #55084

Fix scheduler processing of cleared running tasks stuck in `RESTARTING` state #55084

kaxil commented Aug 29, 2025 •

edited

Loading