-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Fix scheduler processing of cleared running tasks stuck in RESTARTING state
#55084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… state Clearing a DAG with running tasks causes them to get stuck in RESTARTING state indefinitely with continuous heartbeat timeout events. This was a regression in Airflow 3.x where the scheduler failed to process executor completion events for RESTARTING tasks. The scheduler's _process_executor_events() function has two gatekeeping mechanisms that filter executor events, both of which were missing TaskInstanceState.RESTARTING. This caused the scheduler to ignore when tasks in RESTARTING state were successfully terminated by the executor. Changes: - Add RESTARTING to executor event processing filters in scheduler - Add special handling for successful termination of RESTARTING tasks - Set RESTARTING tasks to None (scheduled) state with adjusted max_tries - Update task lifecycle diagram to reflect RESTARTING → None transition - Add regression test for max_tries edge case with proper mock spec This ensures cleared running tasks can retry immediately without getting stuck, consistent with clearing behavior for non-running tasks. Fixes apache#55045
|
Tested this locally and it works. |
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will work, but is there any way we could do this without relying on executor events in the future?
In Airflow 2 when I cleared a task I did not expect on_retry_callback to be executed. I don't think I've seen anyone expecting this. Why would this be an issue specifically now in Airflow 3? |
Yeah, possible to improve in future. Currently, we mainly need it to only run after the current task attempt is killed. Otherwise they could run in parallel i.e new attempt can start running before the first one is fully terminated. |
Backport failed to create: v3-0-test. View the failure log Run details
You can attempt to backport this manually by running: cherry_picker f9adaad v3-0-testThis should apply the commit to the v3-0-test branch and leave the commit in conflict state marking After you have resolved the conflicts, you can continue the backport process by running: cherry_picker --continue |
Mainly cause in 2.x, it first went to UP_FOR_RETRY state after RESTARTING so on_retry_callback could have been executed if set. |
closes #55045
Clearing a DAG with running tasks causes them to get stuck in
RESTARTINGstate indefinitely, with continuous heartbeat timeout events. This is a regression in Airflow 3.x where the scheduler fails to process executor completion events forRESTARTINGtasks. This change also preserves logs.Error Symptoms
RESTARTINGstate after clearing404 Not Foundor409 ConflictSUCCESSbut task state doesn't updateRoot Cause
The scheduler's
_process_executor_events()function has two gatekeeping mechanisms that filter which executor events get processed:Both filters were missing
TaskInstanceState.RESTARTING, causing the scheduler to ignore when tasks inRESTARTINGstate were successfully terminated by the executor.Historical Context
The
RESTARTINGstate was introduced in Airflow 2.2.0 (PR #16681) to handle cleared running tasks gracefully:FAILED(likeSHUTDOWNdid)In Airflow 3.x, the zombie detection mechanism was removed with the transition from Job-based to Task SDK architecture, but the scheduler was never updated to handle
RESTARTINGtasks in executor events.Why not immediately set state to None
The
RESTARTINGstate serves a critical purpose in the clearing workflow:Race condition prevention: When
clear_task_instances()is called, it immediately:ti.idviaprepare_db_for_next_try()RESTARTINGRunning task process continues: The task process keeps running with the old
ti.idand attempts heartbeating, which fails with404 Not Foundsince the database now has a newti.id.Executor termination: The executor eventually kills the running process and reports
SUCCESSfor the terminated task.Scheduler coordination: The
RESTARTINGstate signals to the scheduler that this task was intentionally cleared and should be retried (not failed) when the executor confirms termination.Setting the state directly to
Noneduring clearing would lose this coordination mechanism and could lead to race conditions where the scheduler processes a "completed" task before the executor has terminated the running process.Future Considerations
This fix resolves the immediate regression but raises design questions for future improvements:
on_retry_callbackbe triggered for cleared tasks?email_on_retrybe sent?These questions are deferred to maintain scope and can be addressed in follow-up discussions if needed.
Logs are preserved with this change:
