Save scheduler execution time during search for queued dag_runs #30699

AutomationDev85 · 2023-04-18T09:20:24Z

Hi airflow community,
this is my first PR and be happy to work on the scheduler runtime. We faced an issue with slow scheduler execution time by having millions of queued dag_runs for one DAG. This is the first PR and more is in the queue.

This PR will add .all() to query to match the pydantic definition of function and return only list of dag_runs. This optimize the scheduler runtime because without this change the query is executed 2 times in function _start_queued_dagruns in airflow/jobs/scheduler_job_runner.py. So this saves execution time in the scheduler.

@vandonr-amz fyi, as discussed with @jens-scheffler-bosch

boring-cyborg · 2023-04-18T09:20:27Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

vandonr-amz · 2023-04-18T16:22:13Z

Nice small change with a big impact :)
I'm not familiar with SQLAlchemy, so I'll rephrase my comprehension of your change here to make sure I understand well what you're doing (and tell me if I'm wrong):
The method next_dagruns_to_examine claimed to return a List here
https://github.com/aws-mwaa/upstream-to-airflow/blob/1ebeb19bf7542850fff2f1e2f9795ad70c1b24e2/airflow/models/dagrun.py#L294
but this type annotation was wrong, as it was returning a query, which was lazily ran every time it was iterated.
So calling the method once but iterating twice on the results resulted in the query being executed twice.

Adding this .all() forces evaluation on the spot and actually returns a List.

Did I get this right ?

Did you measure the improvement brought by this PR ? If so, how ? Do you have any result to share ?

jscheffl · 2023-04-18T18:18:00Z

Nice small change with a big impact :) I'm not familiar with SQLAlchemy, so I'll rephrase my comprehension of your change here to make sure I understand well what you're doing (and tell me if I'm wrong): The method next_dagruns_to_examine claimed to return a List here https://github.com/aws-mwaa/upstream-to-airflow/blob/1ebeb19bf7542850fff2f1e2f9795ad70c1b24e2/airflow/models/dagrun.py#L294 but this type annotation was wrong, as it was returning a query, which was lazily ran every time it was iterated. So calling the method once but iterating twice on the results resulted in the query being executed twice.

You are right. Annotation is correct though, a list is returned but it is lazily evaluated by SQLAlchemy.

Adding this .all() forces evaluation on the spot and actually returns a List.

Correct.
The optimization is in the the usage of https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L1326 whereas in lines 1329+1351 two times an iteration is made over the list. As it is lazy evaluated the query is executed two times. But actually the code wants to loop two times over the (static) list.

Did you measure the improvement brought by this PR ? If so, how ? Do you have any result to share ?

It is an improvement depending n your DAG queue length and DB query performance. Together with/before the other PR we had this query running for 5-15 seconds times two. Besides (another PR will do this) the query is in some cases sub-optimal in our scheduler loop we immediately saved 50% of time in this section == 5-15 seconds per scheduler loop.
But if the queue is not too long, the query will take 50-100ms, then you still save 50% of DB efforts :-D

vandonr-amz

Nice, thank you for the detailed explanation :)
(non binding) LGTM 👍

AutomationDev85

Has any one an idea why this failed in the CI:
ERROR tests/utils/test_db_cleanup.py::TestDBCleanup::test__cleanup_table[middle]
ERROR tests/utils/test_db_cleanup.py::TestDBCleanup::test__cleanup_table[end_exactly]
Is this flaky in the CI? I do not think this has something to do with the changes of the PR. How is it possible to trigger the CI run again?

potiuk · 2023-04-20T09:36:48Z

I re-run it. Yes. We have a few flaky tests (we try to keep them down as much as possible but eventually it's the matter of probability it will happen - when they are happening in 1/ 500 runs or so, chances they will get solved are low because reproducibility is low. But usually when it fails in one job only and is fine in the others, it means they are flaky ones.

Luckily we can re-run just the failed job when they fail - this is what I did. (BTW. We might simply apply a flake plugin for those kind of tests in the near future). This is the next improvement I have on my list.

airflow/models/dagrun.py

uranusjr · 2023-05-15T02:28:16Z

(Todo for self: Code around where this function is called can use quite some typing improvements and optimisations using lazy iterators after this one is merged.)

uranusjr · 2023-05-23T10:01:05Z

Need to fix tests

airflow/jobs/scheduler_job_runner.py

pierrejeambrun · 2023-05-24T21:06:39Z

Relaunching failed static check, weird unrelated error on open-api-linter.

edit: Ok I see we have this problem on multiple PRs right now, will most probably fail again until we find a fix. (I believe uranusjr is working on that)

edit: #31518 should have solved that, can you rebase and try again ?

airflow/jobs/scheduler_job_runner.py

* Function returns list of dagruns and not query * Changed pytests * Changed all to _start_queued_dagruns * Added comment and fixed tests * Fixed typo (cherry picked from commit 0fd42ff)

AutomationDev85 requested review from XD-DENG, ashb and kaxil as code owners April 18, 2023 09:20

vandonr-amz approved these changes Apr 18, 2023

View reviewed changes

pierrejeambrun approved these changes Apr 19, 2023

View reviewed changes

AutomationDev85 commented Apr 20, 2023

View reviewed changes

eladkal added this to the Airflow 2.6.1 milestone Apr 28, 2023

ephraimbuddy modified the milestones: Airflow 2.6.1, Airflow 2.6.2 May 8, 2023

uranusjr reviewed May 15, 2023

View reviewed changes

airflow/models/dagrun.py Outdated Show resolved Hide resolved

uranusjr mentioned this pull request May 15, 2023

Improve typing in SchedulerJobRunner #31285

Merged

uranusjr reviewed May 23, 2023

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

uranusjr approved these changes May 24, 2023

View reviewed changes

pierrejeambrun reviewed May 24, 2023

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

Marco Küttelwesch added 5 commits May 25, 2023 12:14

Function returns list of dagruns and not query

7031a20

Changed pytests

ea7eb4c

Changed all to _start_queued_dagruns

1489c4e

Added comment and fixed tests

93e8bd9

Fixed typo

e5694f0

AutomationDev85 force-pushed the feature/optimise-scheduler-run-time-1 branch from c7a4702 to e5694f0 Compare May 25, 2023 12:16

pierrejeambrun merged commit 0fd42ff into apache:main May 25, 2023

eladkal added the type:bug-fix Changelog: Bug Fixes label Jun 8, 2023

eladkal mentioned this pull request Jun 13, 2023

Status of testing of Apache Airflow 2.6.2rc2 #31867

Closed

62 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save scheduler execution time during search for queued dag_runs #30699

Save scheduler execution time during search for queued dag_runs #30699

Uh oh!

AutomationDev85 commented Apr 18, 2023 •

edited

Loading

Uh oh!

boring-cyborg bot commented Apr 18, 2023

Uh oh!

vandonr-amz commented Apr 18, 2023

Uh oh!

jscheffl commented Apr 18, 2023

Uh oh!

vandonr-amz left a comment

Uh oh!

AutomationDev85 left a comment

Uh oh!

potiuk commented Apr 20, 2023

Uh oh!

Uh oh!

uranusjr commented May 15, 2023 •

edited

Loading

Uh oh!

uranusjr commented May 23, 2023

Uh oh!

Uh oh!

pierrejeambrun commented May 24, 2023 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Save scheduler execution time during search for queued dag_runs #30699

Save scheduler execution time during search for queued dag_runs #30699

Uh oh!

Conversation

AutomationDev85 commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boring-cyborg bot commented Apr 18, 2023

Uh oh!

vandonr-amz commented Apr 18, 2023

Uh oh!

jscheffl commented Apr 18, 2023

Uh oh!

vandonr-amz left a comment

Choose a reason for hiding this comment

Uh oh!

AutomationDev85 left a comment

Choose a reason for hiding this comment

Uh oh!

potiuk commented Apr 20, 2023

Uh oh!

Uh oh!

uranusjr commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uranusjr commented May 23, 2023

Uh oh!

Uh oh!

pierrejeambrun commented May 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

AutomationDev85 commented Apr 18, 2023 •

edited

Loading

uranusjr commented May 15, 2023 •

edited

Loading

pierrejeambrun commented May 24, 2023 •

edited

Loading