-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… #5908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A lot conflict that I need to clean up. Please review after I removed the WIP tag and the CI passed. |
|
AIRFLOW-4797 jira is still in done state? |
|
@tooptoop4 I don't have perm to edit the JIRA but if the state of it needs to happen I think it should happen after this PR is merged? |
bad59a8 to
5d42c76
Compare
|
@ashb @mik-laj @milton0825 @kaxil It's ready, PTAL CC: @seelmann |
|
Do we have some numbers around the running time of the old aggregated query and the new find zombie query. Also, the join query can substantially slow down the parser process. |
|
@KevinYang21 The original PR says
Do you get different performance behaviour? |
|
Thank you guys for reviewing! @milton0825 We benchmarked the two approaches during the initial PR 3873 with 4k DAG files and 30k. With aggregated query the DB CPU usage is kept under 50% while with the subprocess query the DB will be killed instantly. In our production cluster at that time, running ~20k tasks concurrently with 2k DAG files, DB CPU went from 80% to ~40%. In our current production DB with >23M rows in task_instance table and >4M rows in job table, average time it takes to run the query takes 0.5 second( we have a powerful DB but the PR being reverted also showed an average of 0.5 second runtime of that query). So it shouldn't slow down the dag processor manager too much. @ashb pg_stat won't get flushed until the DB is restarted so we don't really see the diff in frequency, but that is pretty important in the evaluation here. Even with the provided data, query time of 25 DAG files added would already beat the joined query, not to mention the overhead of starting/stopping the transaction. In general I believe it is better to use the aggregated query, thus leverage the query optimizer, instead of trying to optimize ourselves. And esp. with a large scaled cluster that has huge number of DAG files to parse, it would be a show stopper if we distribute the query to the subprocess. |
|
Just reverting it would reintroduce the zombie detection problem (see Jira and discussion in the inital PR #5420). I'm ok with reverting it if it increases DB load in case you have 1000s of DAGS (we only have 20+). But then we need to find another way to fix the zombie detection. Why is zombie detection at all part of DAG file processing? Can't there be a separate background thread that checks e.g. once per minute (configurable) for zombie tasks? |
|
@KevinYang21 what is the underlying sql statement being run in old vs new? |
|
@seelmann For sure, mb missed the issue described in the JIRA. Basically wrong scope zombie detection logic is placed inside the dag processor manager loop. A straightforward fix can be to extract the logic from heartbeat method to the the outter scope and then send the same version of zombies until the next zombie detection. If the subprocess zombie detection is really preferred, we can later bind the zombie detection logic with dag dir refresh logic to make sure that we send the same version of zombies to the dag file processors in the same round. I'll just go ahead and apply the straightforward fix. But I'm open to any other approach addressing the original zombie detection problem. Then later we can optimize the logic if desired. Does it make sense? |
5d42c76 to
3cd9375
Compare
|
Added the fix as another commit to make clear seperation on the revert commit and the actual fix commit. Thus avoid having two async PRs on the same topic. I can seperate the commits into a seperate PR if it's needed tho. |
tests/utils/test_dag_processing.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems test requires include_examples=True in order to load the test DAG below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MB there, was a local change wasn't committed: missing .py in the file name. I was using the test DAG in tests/dags so wanna prevent loading examples--for less redundant logging and parsing time. I guess ideally we wanna to maintain one DAG file for each such test case to make it more unit test but that convention might introduce too many files so decided to reuse the test DAG file for now.
f93fcc9 to
a879fdd
Compare
|
LGTM |
|
@ashb @milton0825 PTAL |
|
What have we decided to do about zombie detection? Is it okay as it is? |
|
The 2nd commit should fix the zombie detection issue. I was not able yet to test it in our environment, cannot promise if I can find time soon. |
|
I don't really have the context anymore to review this usefully, but if Stefan thinks this is okay lets go with it. We should get this in for 1.10.6 too |
|
@KevinYang21 Can we please create a new Jira issue for this so that it is easier to understand in Changelog |
|
@kaxil Do you think I can just update 4797? So that we have full context in one jira end to end. |
|
Hmm, rethinking it. Nvm I think the PR is fine as it, doesn't need new Jira. |
|
I finally was able to test, zombie detection works great and deterministic with the 2nd commit. Thanks @KevinYang21 for the change, can you please solve the conflicts? |
|
thanks @seelmann, sry for the delay, day time job has been a bit crazy these days. Working on it now. |
…tection (apache#5511)" This reverts commit 2bdb053.
4f6f31d to
7c5e882
Compare
7c5e882 to
4ad08ec
Compare
|
Probably worth doing this PR as a "Rebase and merge" rather than our normal "Squash and merge" so the revert and the fix zombie are maintained as two separate commits I think |
|
@ashb ya that's my plan ;) |
Codecov Report
@@ Coverage Diff @@
## master #5908 +/- ##
==========================================
+ Coverage 80.35% 80.37% +0.01%
==========================================
Files 616 616
Lines 35739 35762 +23
==========================================
+ Hits 28719 28744 +25
+ Misses 7020 7018 -2
Continue to review full report at Codecov.
|
|
Cool. Working on cherry-picking/applying this to v1-10-test then stable too |
|
@KevinYang21 @seelmann could you glance over these two commits make sure I got them right: |
Jira
Description
Original reason stated in the PR why zombie detection was moved
Zombie tasks will be calculate by DAG parsing manager and send to DAG parsing processor to kill. This is to reduce DB CPU load( identified to produce 80% of CPU load during stress test, CPU usage went down from 80%+ to ~40% after this change).from [AIRFLOW-2760] Decouple DAG parsing loop from scheduler loop #3873.
I see no point sending a query joining two biggest tables in every DAG parsing. Establishing new connections is much more expensive than sending an aggregated query. It doesn't seem to deliver any immediate value: the DB load in a smaller cluster was not changing. And if we want to compare the running time diff we compare the aggregated query running time on all DAG file processors vs. the old query running time instead of compare the individual query. We parse a couple thoudsand files in 2 mins and it will generate heavy load to the DB, which I believe is the biggest bottelneck of Airflow scalibility.
Tests
Reverting PR
Commits
Documentation
Code Quality
flake8