Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… #5908

KevinYang21 · 2019-08-24T17:18:20Z

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-XXX
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
- In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

Here are some details about my PR, including screenshots of any UI changes:
Original reason stated in the PR why zombie detection was moved
Zombie tasks will be calculate by DAG parsing manager and send to DAG parsing processor to kill. This is to reduce DB CPU load( identified to produce 80% of CPU load during stress test, CPU usage went down from 80%+ to ~40% after this change).
from [AIRFLOW-2760] Decouple DAG parsing loop from scheduler loop #3873.

I see no point sending a query joining two biggest tables in every DAG parsing. Establishing new connections is much more expensive than sending an aggregated query. It doesn't seem to deliver any immediate value: the DB load in a smaller cluster was not changing. And if we want to compare the running time diff we compare the aggregated query running time on all DAG file processors vs. the old query running time instead of compare the individual query. We parse a couple thoudsand files in 2 mins and it will generate heavy load to the DB, which I believe is the biggest bottelneck of Airflow scalibility.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
Reverting PR

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

Passes flake8

KevinYang21 · 2019-08-24T17:19:25Z

A lot conflict that I need to clean up. Please review after I removed the WIP tag and the CI passed.

tooptoop4 · 2019-08-24T17:39:39Z

AIRFLOW-4797 jira is still in done state?

KevinYang21 · 2019-08-26T21:59:02Z

@tooptoop4 I don't have perm to edit the JIRA but if the state of it needs to happen I think it should happen after this PR is merged?

KevinYang21 · 2019-08-29T01:29:49Z

@ashb @mik-laj @milton0825 @kaxil It's ready, PTAL

CC: @seelmann

KevinYang21 · 2019-09-04T19:49:14Z

add @potiuk @saguziel

milton0825 · 2019-09-04T20:19:39Z

Do we have some numbers around the running time of the old aggregated query and the new find zombie query.

Also, the join query can substantially slow down the parser process.

ashb · 2019-09-05T11:17:52Z

@KevinYang21 The original PR says

The query is faster than the previous one (see also stats below).

Tested on our staging environment (patch applied to Airflow 1.10.3), zombie detection works fine, database load is unchanged. Here some stats from pg_stat_statements, the branch run there for 4 hours: The new query (1st line) is faster but is likely called more frequently. The 2nd line shows stats of the old query.

select calls,mean_time,max_time,rows from pg_stat_statements where query like '%task_instance JOIN job%' and query like '%latest_heartbeat%';
  calls   |     mean_time      |  max_time   | rows 
----------+--------------------+-------------+------
    55416 | 0.0260821553522449 |    5.509762 |   29
 71969011 |  0.575755060854888 | 1078.895322 | 2377

Do you get different performance behaviour?

KevinYang21 · 2019-09-06T04:08:56Z

Thank you guys for reviewing!

@milton0825 We benchmarked the two approaches during the initial PR 3873 with 4k DAG files and 30k. With aggregated query the DB CPU usage is kept under 50% while with the subprocess query the DB will be killed instantly. In our production cluster at that time, running ~20k tasks concurrently with 2k DAG files, DB CPU went from 80% to ~40%. In our current production DB with >23M rows in task_instance table and >4M rows in job table, average time it takes to run the query takes 0.5 second( we have a powerful DB but the PR being reverted also showed an average of 0.5 second runtime of that query). So it shouldn't slow down the dag processor manager too much.

@ashb pg_stat won't get flushed until the DB is restarted so we don't really see the diff in frequency, but that is pretty important in the evaluation here. Even with the provided data, query time of 25 DAG files added would already beat the joined query, not to mention the overhead of starting/stopping the transaction.

In general I believe it is better to use the aggregated query, thus leverage the query optimizer, instead of trying to optimize ourselves. And esp. with a large scaled cluster that has huge number of DAG files to parse, it would be a show stopper if we distribute the query to the subprocess.

seelmann · 2019-09-06T05:12:38Z

Just reverting it would reintroduce the zombie detection problem (see Jira and discussion in the inital PR #5420). I'm ok with reverting it if it increases DB load in case you have 1000s of DAGS (we only have 20+). But then we need to find another way to fix the zombie detection.

Why is zombie detection at all part of DAG file processing? Can't there be a separate background thread that checks e.g. once per minute (configurable) for zombie tasks?

tooptoop4 · 2019-09-06T05:40:19Z

@KevinYang21 what is the underlying sql statement being run in old vs new?

KevinYang21 · 2019-09-06T19:08:31Z

@seelmann For sure, mb missed the issue described in the JIRA. Basically wrong scope zombie detection logic is placed inside the dag processor manager loop. A straightforward fix can be to extract the logic from heartbeat method to the the outter scope and then send the same version of zombies until the next zombie detection. If the subprocess zombie detection is really preferred, we can later bind the zombie detection logic with dag dir refresh logic to make sure that we send the same version of zombies to the dag file processors in the same round. I'll just go ahead and apply the straightforward fix. But I'm open to any other approach addressing the original zombie detection problem. Then later we can optimize the logic if desired. Does it make sense?

KevinYang21 · 2019-09-07T11:08:57Z

Added the fix as another commit to make clear seperation on the revert commit and the actual fix commit. Thus avoid having two async PRs on the same topic. I can seperate the commits into a seperate PR if it's needed tho.

seelmann · 2019-09-07T13:47:30Z

tests/utils/test_dag_processing.py

Seems test requires include_examples=True in order to load the test DAG below.

MB there, was a local change wasn't committed: missing .py in the file name. I was using the test DAG in tests/dags so wanna prevent loading examples--for less redundant logging and parsing time. I guess ideally we wanna to maintain one DAG file for each such test case to make it more unit test but that convention might introduce too many files so decided to reuse the test DAG file for now.

airflow/utils/dag_processing.py

seelmann · 2019-09-09T20:01:20Z

LGTM

KevinYang21 · 2019-09-12T00:26:50Z

@ashb @milton0825 PTAL

ashb · 2019-09-18T12:06:17Z

What have we decided to do about zombie detection? Is it okay as it is?

seelmann · 2019-09-18T18:07:55Z

The 2nd commit should fix the zombie detection issue. I was not able yet to test it in our environment, cannot promise if I can find time soon.

ashb · 2019-09-19T10:38:12Z

I don't really have the context anymore to review this usefully, but if Stefan thinks this is okay lets go with it. We should get this in for 1.10.6 too

kaxil · 2019-09-19T10:41:04Z

@KevinYang21 Can we please create a new Jira issue for this so that it is easier to understand in Changelog

KevinYang21 · 2019-09-21T01:19:11Z

@kaxil Do you think I can just update 4797? So that we have full context in one jira end to end.

kaxil · 2019-09-21T21:04:21Z

Hmm, rethinking it. Nvm I think the PR is fine as it, doesn't need new Jira.

seelmann · 2019-10-04T14:08:19Z

I finally was able to test, zombie detection works great and deterministic with the 2nd commit. Thanks @KevinYang21 for the change, can you please solve the conflicts?

KevinYang21 · 2019-10-10T22:25:47Z

thanks @seelmann, sry for the delay, day time job has been a bit crazy these days. Working on it now.

…tection (apache#5511)" This reverts commit 2bdb053.

ashb · 2019-10-14T20:50:13Z

Probably worth doing this PR as a "Rebase and merge" rather than our normal "Squash and merge" so the revert and the fix zombie are maintained as two separate commits I think

KevinYang21 · 2019-10-14T21:59:03Z

@ashb ya that's my plan ;)

codecov-io · 2019-10-14T22:08:51Z

Codecov Report

Merging #5908 into master will increase coverage by 0.01%.
The diff coverage is 89.13%.

@@            Coverage Diff             @@
##           master    #5908      +/-   ##
==========================================
+ Coverage   80.35%   80.37%   +0.01%     
==========================================
  Files         616      616              
  Lines       35739    35762      +23     
==========================================
+ Hits        28719    28744      +25     
+ Misses       7020     7018       -2

Impacted Files	Coverage Δ
airflow/models/dagbag.py	`92.41% <100%> (+0.03%)`	⬆️
airflow/jobs/scheduler_job.py	`74.55% <66.66%> (+0.03%)`	⬆️
airflow/utils/dag_processing.py	`58.56% <86.95%> (+2.01%)`	⬆️
airflow/contrib/operators/ssh_operator.py	`82.5% <0%> (-1.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a4c164...4ad08ec. Read the comment docs.

ashb · 2019-10-15T10:19:06Z

Cool. Working on cherry-picking/applying this to v1-10-test then stable too

ashb · 2019-10-15T11:21:56Z

@KevinYang21 @seelmann could you glance over these two commits make sure I got them right:

8af58da01
b0ec8716f

KevinYang21 force-pushed the revert_4797 branch 7 times, most recently from bad59a8 to 5d42c76 Compare August 28, 2019 23:17

KevinYang21 changed the title ~~[WIP]Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de…~~ Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… Aug 29, 2019

KevinYang21 requested review from BasPH, XD-DENG, ashb, feng-tao, kaxil, milton0825 and potiuk September 4, 2019 19:53

KevinYang21 force-pushed the revert_4797 branch from 5d42c76 to 3cd9375 Compare September 7, 2019 01:05

seelmann reviewed Sep 7, 2019

View reviewed changes

airflow/utils/dag_processing.py Outdated Show resolved Hide resolved

KevinYang21 force-pushed the revert_4797 branch 2 times, most recently from f93fcc9 to a879fdd Compare September 8, 2019 08:20

kaxil force-pushed the master branch from 558f9f1 to d719e1f Compare October 2, 2019 15:30

ashb approved these changes Oct 11, 2019

View reviewed changes

Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de…

19c8c59

…tection (apache#5511)" This reverts commit 2bdb053.

KevinYang21 force-pushed the revert_4797 branch from 4f6f31d to 7c5e882 Compare October 14, 2019 17:17

milton0825 approved these changes Oct 14, 2019

View reviewed changes

[AIRFLOW-4797] Use same zombies in all DAG file processors

4ad08ec

KevinYang21 force-pushed the revert_4797 branch from 7c5e882 to 4ad08ec Compare October 14, 2019 20:30

KevinYang21 merged commit cb0dbe3 into apache:master Oct 14, 2019

Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… #5908

Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… #5908

Uh oh!

Conversation

KevinYang21 commented Aug 24, 2019

Jira

Description

Tests

Commits

Documentation

Code Quality

Uh oh!

KevinYang21 commented Aug 24, 2019

Uh oh!

tooptoop4 commented Aug 24, 2019

Uh oh!

KevinYang21 commented Aug 26, 2019

Uh oh!

KevinYang21 commented Aug 29, 2019

Uh oh!

KevinYang21 commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milton0825 commented Sep 4, 2019

Uh oh!

ashb commented Sep 5, 2019

Uh oh!

KevinYang21 commented Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seelmann commented Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tooptoop4 commented Sep 6, 2019

Uh oh!

KevinYang21 commented Sep 6, 2019

Uh oh!

KevinYang21 commented Sep 7, 2019

Uh oh!

seelmann Sep 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KevinYang21 Sep 7, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seelmann commented Sep 9, 2019

Uh oh!

KevinYang21 commented Sep 12, 2019

Uh oh!

ashb commented Sep 18, 2019

Uh oh!

seelmann commented Sep 18, 2019

Uh oh!

ashb commented Sep 19, 2019

Uh oh!

kaxil commented Sep 19, 2019

Uh oh!

KevinYang21 commented Sep 21, 2019

Uh oh!

kaxil commented Sep 21, 2019

Uh oh!

seelmann commented Oct 4, 2019

Uh oh!

KevinYang21 commented Oct 10, 2019

Uh oh!

ashb commented Oct 14, 2019

Uh oh!

KevinYang21 commented Oct 14, 2019

Uh oh!

codecov-io commented Oct 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ashb commented Oct 15, 2019

Uh oh!

ashb commented Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KevinYang21 commented Sep 4, 2019 •

edited

Loading

KevinYang21 commented Sep 6, 2019 •

edited

Loading

seelmann commented Sep 6, 2019 •

edited

Loading

seelmann Sep 7, 2019 •

edited

Loading

codecov-io commented Oct 14, 2019 •

edited

Loading

ashb commented Oct 15, 2019 •

edited

Loading