Fix Tasks getting stuck in scheduled state #19747

tanelk · 2021-11-22T15:24:03Z

closes #19622

The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days.

If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting.

The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

boring-cyborg · 2021-11-22T15:24:07Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

tanelk · 2021-11-22T15:25:42Z

airflow/jobs/scheduler_job.py

unrelated, but corrected a variable name, that caused some confusion

ashb

This offset approach won't work well with multiple schedulers running

vapiravfif · 2021-11-23T08:33:04Z

Apologies if this comment won't help the discussion, but in my opinion the better approach could may be something being done around the "limit" clause -
query = query.limit(max_tis) (origin)
and
max_tis = min(self.max_tis_per_query, self.executor.slots_available) (origin)

Because it seems that the assumption that it is a better to limit query of tasks waiting to be queued by potentially available slots might not be the best practice in some cases.

tanelk · 2021-11-23T09:04:55Z

Apologies if this comment won't help the discussion, but in my opinion the better approach could may be something being done around the "limit" clause - query = query.limit(max_tis) (origin) and max_tis = min(self.max_tis_per_query, self.executor.slots_available) (origin)

Because it seems that the assumption that it is a better to limit query of tasks waiting to be queued by potentially available slots might not be the best practice in some cases.

Yes reducing the value of max_tis will make this situation worse, but removing the max_tis = min(...) does not guarantee that the issue will get solved.

As long as we are using any limit value, there is a chance that none of the selected task instancess can be queued but there are some task instances beyond the limit that could have been scheduled.

I can think of 3 possible solutions, that will never cause the scheduler to get stuck:

Remove the limit - could risk memory issues on a large airflow installation (many dags and tasks)
Build all the concurrency filters into the SQL query - the task instance limit seems to be impossible/very difficult in the current data model.
Some sort of iterative approach to look at the task instances further down - this is proposed in this PR.

tanelk · 2021-11-23T15:21:32Z

This offset approach won't work well with multiple schedulers running

@ashb I now reworked the method

airflow/jobs/scheduler_job.py

tests/jobs/test_scheduler_job.py

ashb · 2021-11-26T14:48:03Z

A lot more work this way, but I think this looks good. I'll need to take a look at this again with fresher eyes next week.

How much/where have you tested this change?

# Conflicts: # airflow/jobs/scheduler_job.py

Address comments Fix flacky test Update test_scheduler_job.py

kaxil · 2021-12-17T20:39:33Z

Just rebased on latest main branch to fix merge conflicts

ephraimbuddy · 2022-03-21T21:22:03Z

@tanelk please rebase. I guess the failing tests have been resolved in main

jedcunningham

We definitely need another committer to take a look at this as well.

github-actions · 2022-03-22T00:43:05Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

…n_scheduled

boring-cyborg · 2022-03-22T17:42:30Z

Awesome work, congrats on your first merged pull request!

The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days. If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting. The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out. Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> (cherry picked from commit cd68540)

jedcunningham · 2022-03-23T15:11:11Z

Thanks @tanelk! Congrats on your first commit 🎉

The scheduler_job can get stuck in a state, where it is not able to queue new tasks. It will get out of this state on its own, but the time taken depends on the runtime of current tasks - this could be several hours or even days. If the scheduler can't queue any tasks because of different concurrency limits (per pool, dag or task), then on next iterations of the scheduler loop it will try to queue the same tasks. Meanwhile there could be some scheduled tasks with lower priority_weight that could be queued, but they will remain waiting. The proposed solution is to keep track of dag and task ids, that are concurrecy limited and then repeat the query with these dags and tasks filtered out. Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> (cherry picked from commit cd68540)

tanelk requested review from XD-DENG, ashb and kaxil as code owners November 22, 2021 15:24

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 22, 2021

tanelk commented Nov 22, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated

Copy link

Contributor Author

tanelk Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated, but corrected a variable name, that caused some confusion

ashb previously requested changes Nov 22, 2021

View reviewed changes

tanelk requested a review from ashb November 23, 2021 06:49

tanelk changed the title ~~Fix Tasks get stuck in scheduled state~~ Fix Tasks getting stuck in scheduled state Nov 23, 2021

Fix Tasks get stuck in scheduled state

c4b9a54

tanelk force-pushed the 19622_tasks_stuck_in_scheduled branch from 2619de4 to c4b9a54 Compare November 26, 2021 08:48

Tanel Kiis added 4 commits November 26, 2021 11:50

Fix flacky test

2540dd3

ignore run_id

04d9458

format

08fa431

mssql

3d9c84d

ashb reviewed Nov 26, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ashb reviewed Nov 26, 2021

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

Address comments

2b90df2

Tanel Kiis added 2 commits December 13, 2021 11:36

Merge branch 'main' into 19622_tasks_stuck_in_scheduled

3abc181

# Conflicts: # airflow/jobs/scheduler_job.py

Fix merge

d4bdf0f

kaxil force-pushed the 19622_tasks_stuck_in_scheduled branch 2 times, most recently from 854ed2f to 95e9dff Compare December 17, 2021 20:15

tanelk added 2 commits December 17, 2021 20:38

Fix Tasks get stuck in scheduled state

e7c36a8

ignore run_id

055fff7

Address comments Fix flacky test Update test_scheduler_job.py

kaxil force-pushed the 19622_tasks_stuck_in_scheduled branch from 95e9dff to 055fff7 Compare December 17, 2021 20:38

mssql

3f0b7da

jedcunningham modified the milestones: Airflow 2.2.4, Airflow 2.2.5 Feb 18, 2022

SamWheating mentioned this pull request Feb 26, 2022

AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788

Closed

2 tasks

tanelk mentioned this pull request Mar 3, 2022

Tasks stuck in scheduled state from different dags waiting for other dag tasks to complete first #21951

Closed

2 tasks

tanelk requested a review from ashb March 11, 2022 14:56

jedcunningham approved these changes Mar 22, 2022

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Mar 22, 2022

Merge remote-tracking branch 'upstream/main' into 19622_tasks_stuck_i…

c26e4db

…n_scheduled

ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Mar 22, 2022

ephraimbuddy approved these changes Mar 22, 2022

View reviewed changes

ephraimbuddy closed this Mar 22, 2022

ephraimbuddy reopened this Mar 22, 2022

ephraimbuddy merged commit cd68540 into apache:main Mar 22, 2022

tanelk deleted the 19622_tasks_stuck_in_scheduled branch March 23, 2022 16:10

ephraimbuddy mentioned this pull request Mar 27, 2022

Status of testing of Apache Airflow 2.2.5rc3 #22549

Closed

36 tasks

tanelk mentioned this pull request Apr 12, 2022

Fix regression in pool metrics #22939

Merged

Fix Tasks getting stuck in scheduled state #19747

Fix Tasks getting stuck in scheduled state #19747

Uh oh!

Conversation

tanelk commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boring-cyborg bot commented Nov 22, 2021

Uh oh!

tanelk Nov 22, 2021

Choose a reason for hiding this comment

Uh oh!

ashb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vapiravfif commented Nov 23, 2021

Uh oh!

tanelk commented Nov 23, 2021

Uh oh!

tanelk commented Nov 23, 2021

Uh oh!

Uh oh!

Uh oh!

ashb commented Nov 26, 2021

Uh oh!

kaxil commented Dec 17, 2021

Uh oh!

ephraimbuddy commented Mar 21, 2022

Uh oh!

jedcunningham left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 22, 2022

Uh oh!

boring-cyborg bot commented Mar 22, 2022

Uh oh!

jedcunningham commented Mar 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tanelk commented Nov 22, 2021 •

edited

Loading

ashb left a comment •

edited

Loading