Skip to content

Conversation

@AutomationDev85
Copy link
Contributor

Hi airflow community,
this is my third PR and be happy to work on the scheduler runtime again. We faced an issue with slow scheduler execution time by having millions of queued dag_runs for one DAG.

I found out that the query next_dagruns_to_examine needs many seconds to be executed with huge amount of dag_runs. I was able to drop the execution time to some ms by using this Index. This was really a big jump in the performance of the scheduler. Can someone support me how to get this into the migration scripts? What are you thinking about this Index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea to add partial index here, Likely the improvment in your case is because the whole index takes too much memory of your server and gest swapped out. I do not know that much about thos partial indexes but it does seem like a legitimate optimisiation that might improve one of the most frequent queries scheduler runs,

I do not see an immediate side-effect of that one (and regarding the migration - I think you should follow the regular CONTRIBUTORS.rst alembic guide: https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#metadata-database-updates

but I think I'd love others to comment if they also see any problem with such an optimisation?

Copy link
Contributor Author

@AutomationDev85 AutomationDev85 Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback! I moved the Index away from the dagrun.py because then the auto migration failes on the sqllite because not supports NULLS FIRST. Added the Index now only into the migration if not sqllite. Is this the way to go or did i missed something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a bried look ad SQLAlchemy docs, would https://docs.sqlalchemy.org/en/20/core/constraints.html#sqlalchemy.schema.Constraint.dialect_options help in this area? (=dialect_options) to have this specific condition in postgres only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it this way but then - for consistency - we should change it everywhere else (because we are doing the same approach in multiple other places). I prefer this kind of explicit if actually.

Comment on lines 44 to 49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This index needs to be also declared somewhere in the ORM, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem for me was after I added the Index to the table_args of dagrun alembic tried to add this index automatically during pre-commit and this was failing because sqlite does not handle the NULLS FIRST. So I only added this in the migration script. Is there a better way to handle this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have pre-commit to generate alembic migrations automatically.

Copy link
Member

@potiuk potiuk Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a script to check their names and rename them appriately if they are added - but not to generate them. You have to generate it with alembic (and manually correct possibly) and add manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added the index again in the table_args . If i added this in the table_args and executing the pre-commit after "alembic revision -m "add new field to db"" step it was failing do to the null errors and thought alembic is automatically trying to convert DB.... Only with removing the index out of table_args and then working on the alembic script it was passing the pre-commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also have to change 2.6.0 to 2.7.0 references in the migratio name/versions and rebase your PR after we merge #30958 - this one will update the versions (2.6.0 branch is already created and we are releasing the 2.6.0 from that branch now)

@potiuk
Copy link
Member

potiuk commented Apr 30, 2023

Now, that 2.7.0dev change has been merged, you will need to rebase and update 2.6 to 2.7 everywhere.

@AutomationDev85 AutomationDev85 force-pushed the feature/optimize-scheduler-run-time-4 branch from 36ed372 to 0ba2cd1 Compare May 2, 2023 06:25
Copy link
Contributor Author

@AutomationDev85 AutomationDev85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint! Updated the code.

@potiuk potiuk mentioned this pull request May 2, 2023
Copy link
Contributor Author

@AutomationDev85 AutomationDev85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok in the CI we are getting now the same error I mentioned before regarding the sqlite and NULLS FIRST:
sqlite3.OperationalError: unsupported use of NULLS FIRST

Has anybody an idea how to handle this?

@potiuk
Copy link
Member

potiuk commented May 8, 2023

@AutomationDev85 - I've merged another migration, can you plese rebase and resolve the conflict?

@jscheffl jscheffl force-pushed the feature/optimize-scheduler-run-time-4 branch from 4cd74ca to 8dcd734 Compare May 14, 2023 21:28
@jscheffl
Copy link
Contributor

@AutomationDev85 - I've merged another migration, can you plese rebase and resolve the conflict?

While @AutomationDev85 is on leave I tried to re-base on the latest main. Stumbled a bit around the migration tooling and needed multiple approaches. Hope I made it right now.

@potiuk potiuk merged commit c63b777 into apache:main May 19, 2023
potiuk added a commit to potiuk/airflow that referenced this pull request May 19, 2023
potiuk added a commit that referenced this pull request May 19, 2023
* Revert "Revert "Save scheduler execution time by caching dags (#30704)" (#31413)"

This reverts commit e6f2117.

* Revert "Save scheduler execution time by adding new Index idea for dag_run (#30827)"

This reverts commit c63b777.
@potiuk
Copy link
Member

potiuk commented May 19, 2023

Hello @AutomationDev85 -> unfortunately we had to revert that one, as it caused problems for MSSQL index. Can you please redo this one in a new PR ? We already fixed the condition that skipped MSSQL in cases like yours, but your PR has been based on earlier version of CI (it was 20 commits behind). Next time when you open it, it should also run for MSSQL and it should fail with an error there that you will have a chance to fix before we merge it again.

That should not happen in the future, we caught that case in one of the recent PRs and fixed it, you just need to base your change on the latest main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:improvement Changelog: Improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants