Skip to content

Conversation

@joshowen
Copy link
Contributor

@joshowen joshowen commented Aug 9, 2023

Optimized queryplan to avoid single item groupby which isn't able to efficiently use the dag_run_dag_id_execution_date_key index:

Before:
QUERY PLAN
Nested Loop (cost=1000.55..367593.28 rows=17 width=189) (actual time=1862.277..1862.420 rows=1 loops=1)
-> Finalize GroupAggregate (cost=1000.00..366374.21 rows=142 width=28) (actual time=1862.234..1862.376 rows=1 loops=1)
Group Key: dag_run_1.dag_id
-> Gather (cost=1000.00..366371.37 rows=284 width=28) (actual time=1855.233..1862.361 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=0.00..365342.97 rows=142 width=28) (actual time=1849.562..1849.563 rows=1 loops=3)
Group Key: dag_run_1.dag_id
-> Parallel Seq Scan on dag_run dag_run_1 (cost=0.00..363162.04 rows=435902 width=28) (actual time=271.546..1736.712 rows=351110 loops=3)
Filter: (((dag_id)::text = 'REDACTED'::text) AND (((run_type)::text = 'backfill'::text) OR ((run_type)::text = 'scheduled'::text)))
Rows Removed by Filter: 782975
-> Index Scan using dag_run_dag_id_execution_date_key on dag_run (cost=0.56..8.57 rows=1 width=189) (actual time=0.036..0.036 rows=1 loops=1)
Index Cond: (((dag_id)::text = (dag_run_1.dag_id)::text) AND (execution_date = (max(dag_run_1.execution_date))))
Planning Time: 0.369 ms
Execution Time: 1862.473 ms
After:
QUERY PLAN
Nested Loop (cost=2.50..7.28 rows=5 width=189) (actual time=0.048..0.050 rows=1 loops=1)
-> Result (cost=1.95..1.96 rows=1 width=8) (actual time=0.038..0.039 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.56..1.95 rows=1 width=8) (actual time=0.037..0.037 rows=1 loops=1)
-> Index Scan Backward using dag_run_dag_id_execution_date_key on dag_run dag_run_1 (cost=0.56..1456402.98 rows=1046683 width=8) (actual time=0.036..0.036 rows=1 loops=1)
Index Cond: (((dag_id)::text = 'REDACTED'::text) AND (execution_date IS NOT NULL))
Filter: (((run_type)::text = 'backfill'::text) OR ((run_type)::text = 'scheduled'::text))
-> Index Scan using dag_run_dag_id_execution_date_key on dag_run (cost=0.56..24.37 rows=5 width=189) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (((dag_id)::text = 'REDACTED'::text) AND (execution_date = ($0)))
Planning Time: 0.403 ms
Execution Time: 0.087 ms

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@joshowen joshowen changed the title Index optimized fast path to avoid more complicated & slower groupby queryplan Index optimized fast path to avoid slower groupby queryplan Aug 9, 2023
@uranusjr
Copy link
Member

uranusjr commented Aug 9, 2023

Instead o duplicating much of the query in two branches, I’d extract the actually different part (there where argument) instead.

@uranusjr
Copy link
Member

uranusjr commented Aug 9, 2023

Also please set up pre-commit according to the contribution guide and fix linting issues.

@joshowen
Copy link
Contributor Author

joshowen commented Aug 10, 2023

@uranusjr I didn't see a clean way to do this without branching the query. This change touches both the subquery and the where clause of the parent.

@joshowen joshowen marked this pull request as ready for review August 10, 2023 17:04
@uranusjr
Copy link
Member

I think we need a test case to cover the one DAG case.

@joshowen
Copy link
Contributor Author

I think we need a test case to cover the one DAG case.

The one dag case is pretty well covered in the integration tests. I previously had a bug in this code and used those to find it (https://github.com/apache/airflow/actions/runs/5857890497/job/15898563419?pr=33242#step:5:1267). I'm happy to add a specific test case though if that is preferable though.

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I think it would be nice to add a specific test case for this. Wdyt @uranusjr?

@uranusjr
Copy link
Member

Both branches should be covered by tests, whichever covers which.

@eladkal
Copy link
Contributor

eladkal commented Sep 13, 2023

@joshowen can you add the needed tests?

@joshowen
Copy link
Contributor Author

@uranusjr @amoghrajesh @eladkal I've added tests to cover both paths and rebased.

@joshowen
Copy link
Contributor Author

@uranusjr @amoghrajesh @eladkal Thoughts on this PR?

@eladkal eladkal added this to the Airflow 2.8.0 milestone Oct 27, 2023
@eladkal eladkal added the type:improvement Changelog: Improvements label Oct 27, 2023
@dstandish
Copy link
Contributor

This is a good optimization. We ran into this exact issue and came up with the same optimization independently. But we were thinking to just add an index since there are a couple other instances of if very similar query that I don't think can do the same thing.

But even if we do add the index I don't think it hurts to have both.

My one suggestion would be to move the query generation to a private function somewhere so that things are more readable at the call site.

@dstandish
Copy link
Contributor

hi @joshowen i moved the query builder to a function and added some testing for the query structure. please take a look. i changed the structure of the query slightly so that we don't have to join to the subquery in the 1 dag path (it can be just execution_date = (subquery))

@joshowen
Copy link
Contributor Author

hi @joshowen i moved the query builder to a function and added some testing for the query structure. please take a look. i changed the structure of the query slightly so that we don't have to join to the subquery in the 1 dag path (it can be just execution_date = (subquery))

lgtm. Thanks!

@dstandish dstandish merged commit 10c04a4 into apache:main Jan 22, 2024
@dstandish dstandish changed the title Index optimized fast path to avoid slower groupby queryplan Optimize max_execution_date query in single dag case Jan 22, 2024
@jedcunningham jedcunningham added type:bug-fix Changelog: Bug Fixes and removed type:improvement Changelog: Improvements labels Feb 9, 2024
jedcunningham pushed a commit that referenced this pull request Feb 9, 2024
We can make better use of an index when we're only dealing with one dag, which is a common case.

---------

Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
(cherry picked from commit 10c04a4)
@potiuk potiuk removed this from the Airflow 2.8.2 milestone Feb 13, 2024
@ephraimbuddy ephraimbuddy added this to the Airflow 2.9.0 milestone Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants