Skip to content

Conversation

@jscheffl
Copy link
Contributor

Backport of #43520.
Note: Cherry-pick is w/o K8s provider files as these are always taken from main during test and release.

The old "stuck in queued" logic just failed the tasks. Now we requeue them. We accomplish this by revoking the task from executor and setting state to scheduled. We'll re-queue it up to 2 times. Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc. We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action? Anyway this avoids having to deal with "state mismatch" issues when processing events.


(cherry picked from commit a41feeb)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@jscheffl jscheffl added this to the Airflow 2.10.4 milestone Nov 18, 2024
@boring-cyborg boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler kind:documentation labels Nov 18, 2024
@jscheffl jscheffl added the type:bug-fix Changelog: Bug Fixes label Nov 18, 2024
@dstandish
Copy link
Contributor

might need this as well @jscheffl #44093

@jscheffl
Copy link
Contributor Author

might need this as well @jscheffl #44093

Yeeah, figured out the same commit right at the same time :-D Added to the PR!

@jscheffl jscheffl merged commit 341d36d into apache:v2-10-test Nov 19, 2024
utkarsharma2 pushed a commit that referenced this pull request Dec 4, 2024
…44158)

* [v2-10-test] Re-queue tassk when they are stuck in queued (#43520)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* fix test_handle_stuck_queued_tasks_multiple_attempts (#44093)

---------

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: GPK <gopidesupavan@gmail.com>
utkarsharma2 pushed a commit that referenced this pull request Dec 9, 2024
…44158)

* [v2-10-test] Re-queue tassk when they are stuck in queued (#43520)

The old "stuck in queued" logic just failed the tasks.  Now we requeue them.  We accomplish this by revoking the task from executor and setting state to scheduled.  We'll re-queue it up to 2 times.  Number of times is configurable by hidden config.

We added a method to base executor revoke_task because, it's a discrete operation that is required for this feature, and it might be useful in other cases e.g. when detecting as zombies etc.  We set state to failed or scheduled directly from scheduler (rather than sending through the event buffer) because event buffer makes more sense for handling external events -- why round trip through the executor and back to scheduler when scheduler is initiating the action?  Anyway this avoids having to deal with "state mismatch" issues when processing events.

---------

(cherry picked from commit a41feeb)

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* fix test_handle_stuck_queued_tasks_multiple_attempts (#44093)

---------

Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: GPK <gopidesupavan@gmail.com>
@jscheffl jscheffl deleted the backport-a41feeb-v2-10-test branch October 5, 2025 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler kind:documentation type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants