-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Fix Tasks getting stuck in scheduled state #19747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Tasks getting stuck in scheduled state #19747
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
airflow/jobs/scheduler_job.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated, but corrected a variable name, that caused some confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This offset approach won't work well with multiple schedulers running
|
Apologies if this comment won't help the discussion, but in my opinion the better approach could may be something being done around the "limit" clause - Because it seems that the assumption that it is a better to limit query of tasks waiting to be queued by potentially available slots might not be the best practice in some cases. |
Yes reducing the value of As long as we are using any limit value, there is a chance that none of the selected task instancess can be queued but there are some task instances beyond the limit that could have been scheduled. I can think of 3 possible solutions, that will never cause the scheduler to get stuck:
|
@ashb I now reworked the method |
2619de4 to
c4b9a54
Compare
|
A lot more work this way, but I think this looks good. I'll need to take a look at this again with fresher eyes next week. How much/where have you tested this change? |
# Conflicts: # airflow/jobs/scheduler_job.py
854ed2f to
95e9dff
Compare
Address comments Fix flacky test Update test_scheduler_job.py
95e9dff to
055fff7
Compare
|
Just rebased on latest main branch to fix merge conflicts |
|
@tanelk please rebase. I guess the failing tests have been resolved in main |
jedcunningham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need another committer to take a look at this as well.
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
|
Awesome work, congrats on your first merged pull request! |
The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days. If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting. The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out. Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> (cherry picked from commit cd68540)
The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days. If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting. The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out. Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> (cherry picked from commit cd68540)
|
Thanks @tanelk! Congrats on your first commit 🎉 |
The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days. If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting. The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out. Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> (cherry picked from commit cd68540)
The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days. If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting. The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out. Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> (cherry picked from commit cd68540)
The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days. If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting. The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out. Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> (cherry picked from commit cd68540)
closes #19622
The
scheduler_jobcan get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days.If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower
priority_weightthat could be queued, but they will remain waiting.The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.