Skip to content

Conversation

@KevinYang21
Copy link
Member

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Original reason stated in the PR why zombie detection was moved
    Zombie tasks will be calculate by DAG parsing manager and send to DAG parsing processor to kill. This is to reduce DB CPU load( identified to produce 80% of CPU load during stress test, CPU usage went down from 80%+ to ~40% after this change).
    from [AIRFLOW-2760] Decouple DAG parsing loop from scheduler loop #3873.

I see no point sending a query joining two biggest tables in every DAG parsing. Establishing new connections is much more expensive than sending an aggregated query. It doesn't seem to deliver any immediate value: the DB load in a smaller cluster was not changing. And if we want to compare the running time diff we compare the aggregated query running time on all DAG file processors vs. the old query running time instead of compare the individual query. We parse a couple thoudsand files in 2 mins and it will generate heavy load to the DB, which I believe is the biggest bottelneck of Airflow scalibility.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Reverting PR

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@KevinYang21
Copy link
Member Author

A lot conflict that I need to clean up. Please review after I removed the WIP tag and the CI passed.

@tooptoop4
Copy link
Contributor

AIRFLOW-4797 jira is still in done state?

@KevinYang21
Copy link
Member Author

@tooptoop4 I don't have perm to edit the JIRA but if the state of it needs to happen I think it should happen after this PR is merged?

@KevinYang21 KevinYang21 force-pushed the revert_4797 branch 7 times, most recently from bad59a8 to 5d42c76 Compare August 28, 2019 23:17
@KevinYang21 KevinYang21 changed the title [WIP]Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… Aug 29, 2019
@KevinYang21
Copy link
Member Author

@ashb @mik-laj @milton0825 @kaxil It's ready, PTAL

CC: @seelmann

@KevinYang21
Copy link
Member Author

KevinYang21 commented Sep 4, 2019

add @potiuk @saguziel

@milton0825
Copy link
Contributor

Do we have some numbers around the running time of the old aggregated query and the new find zombie query.

Also, the join query can substantially slow down the parser process.

@ashb
Copy link
Member

ashb commented Sep 5, 2019

@KevinYang21 The original PR says

The query is faster than the previous one (see also stats below).

Tested on our staging environment (patch applied to Airflow 1.10.3), zombie detection works fine, database load is unchanged. Here some stats from pg_stat_statements, the branch run there for 4 hours: The new query (1st line) is faster but is likely called more frequently. The 2nd line shows stats of the old query.

select calls,mean_time,max_time,rows from pg_stat_statements where query like '%task_instance JOIN job%' and query like '%latest_heartbeat%';
  calls   |     mean_time      |  max_time   | rows 
----------+--------------------+-------------+------
    55416 | 0.0260821553522449 |    5.509762 |   29
 71969011 |  0.575755060854888 | 1078.895322 | 2377

Do you get different performance behaviour?

@KevinYang21
Copy link
Member Author

KevinYang21 commented Sep 6, 2019

Thank you guys for reviewing!

@milton0825 We benchmarked the two approaches during the initial PR 3873 with 4k DAG files and 30k. With aggregated query the DB CPU usage is kept under 50% while with the subprocess query the DB will be killed instantly. In our production cluster at that time, running ~20k tasks concurrently with 2k DAG files, DB CPU went from 80% to ~40%. In our current production DB with >23M rows in task_instance table and >4M rows in job table, average time it takes to run the query takes 0.5 second( we have a powerful DB but the PR being reverted also showed an average of 0.5 second runtime of that query). So it shouldn't slow down the dag processor manager too much.

@ashb pg_stat won't get flushed until the DB is restarted so we don't really see the diff in frequency, but that is pretty important in the evaluation here. Even with the provided data, query time of 25 DAG files added would already beat the joined query, not to mention the overhead of starting/stopping the transaction.

In general I believe it is better to use the aggregated query, thus leverage the query optimizer, instead of trying to optimize ourselves. And esp. with a large scaled cluster that has huge number of DAG files to parse, it would be a show stopper if we distribute the query to the subprocess.

@seelmann
Copy link
Member

seelmann commented Sep 6, 2019

Just reverting it would reintroduce the zombie detection problem (see Jira and discussion in the inital PR #5420). I'm ok with reverting it if it increases DB load in case you have 1000s of DAGS (we only have 20+). But then we need to find another way to fix the zombie detection.

Why is zombie detection at all part of DAG file processing? Can't there be a separate background thread that checks e.g. once per minute (configurable) for zombie tasks?

@tooptoop4
Copy link
Contributor

@KevinYang21 what is the underlying sql statement being run in old vs new?

@KevinYang21
Copy link
Member Author

@seelmann For sure, mb missed the issue described in the JIRA. Basically wrong scope zombie detection logic is placed inside the dag processor manager loop. A straightforward fix can be to extract the logic from heartbeat method to the the outter scope and then send the same version of zombies until the next zombie detection. If the subprocess zombie detection is really preferred, we can later bind the zombie detection logic with dag dir refresh logic to make sure that we send the same version of zombies to the dag file processors in the same round. I'll just go ahead and apply the straightforward fix. But I'm open to any other approach addressing the original zombie detection problem. Then later we can optimize the logic if desired. Does it make sense?

@KevinYang21
Copy link
Member Author

Added the fix as another commit to make clear seperation on the revert commit and the actual fix commit. Thus avoid having two async PRs on the same topic. I can seperate the commits into a seperate PR if it's needed tho.

Copy link
Member

@seelmann seelmann Sep 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems test requires include_examples=True in order to load the test DAG below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MB there, was a local change wasn't committed: missing .py in the file name. I was using the test DAG in tests/dags so wanna prevent loading examples--for less redundant logging and parsing time. I guess ideally we wanna to maintain one DAG file for each such test case to make it more unit test but that convention might introduce too many files so decided to reuse the test DAG file for now.

@KevinYang21 KevinYang21 force-pushed the revert_4797 branch 2 times, most recently from f93fcc9 to a879fdd Compare September 8, 2019 08:20
@seelmann
Copy link
Member

seelmann commented Sep 9, 2019

LGTM

@KevinYang21
Copy link
Member Author

@ashb @milton0825 PTAL

@ashb
Copy link
Member

ashb commented Sep 18, 2019

What have we decided to do about zombie detection? Is it okay as it is?

@seelmann
Copy link
Member

The 2nd commit should fix the zombie detection issue. I was not able yet to test it in our environment, cannot promise if I can find time soon.

@ashb
Copy link
Member

ashb commented Sep 19, 2019

I don't really have the context anymore to review this usefully, but if Stefan thinks this is okay lets go with it. We should get this in for 1.10.6 too

@kaxil
Copy link
Member

kaxil commented Sep 19, 2019

@KevinYang21 Can we please create a new Jira issue for this so that it is easier to understand in Changelog

@KevinYang21
Copy link
Member Author

@kaxil Do you think I can just update 4797? So that we have full context in one jira end to end.

@kaxil
Copy link
Member

kaxil commented Sep 21, 2019

Hmm, rethinking it. Nvm I think the PR is fine as it, doesn't need new Jira.

@seelmann
Copy link
Member

seelmann commented Oct 4, 2019

I finally was able to test, zombie detection works great and deterministic with the 2nd commit. Thanks @KevinYang21 for the change, can you please solve the conflicts?

@KevinYang21
Copy link
Member Author

thanks @seelmann, sry for the delay, day time job has been a bit crazy these days. Working on it now.

@ashb
Copy link
Member

ashb commented Oct 14, 2019

Probably worth doing this PR as a "Rebase and merge" rather than our normal "Squash and merge" so the revert and the fix zombie are maintained as two separate commits I think

@KevinYang21
Copy link
Member Author

@ashb ya that's my plan ;)

@codecov-io
Copy link

codecov-io commented Oct 14, 2019

Codecov Report

Merging #5908 into master will increase coverage by 0.01%.
The diff coverage is 89.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5908      +/-   ##
==========================================
+ Coverage   80.35%   80.37%   +0.01%     
==========================================
  Files         616      616              
  Lines       35739    35762      +23     
==========================================
+ Hits        28719    28744      +25     
+ Misses       7020     7018       -2
Impacted Files Coverage Δ
airflow/models/dagbag.py 92.41% <100%> (+0.03%) ⬆️
airflow/jobs/scheduler_job.py 74.55% <66.66%> (+0.03%) ⬆️
airflow/utils/dag_processing.py 58.56% <86.95%> (+2.01%) ⬆️
airflow/contrib/operators/ssh_operator.py 82.5% <0%> (-1.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a4c164...4ad08ec. Read the comment docs.

@KevinYang21 KevinYang21 merged commit cb0dbe3 into apache:master Oct 14, 2019
@ashb
Copy link
Member

ashb commented Oct 15, 2019

Cool. Working on cherry-picking/applying this to v1-10-test then stable too

@ashb
Copy link
Member

ashb commented Oct 15, 2019

@KevinYang21 @seelmann could you glance over these two commits make sure I got them right:

8af58da01
b0ec8716f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants