-
Notifications
You must be signed in to change notification settings - Fork 16.4k
preload airflow imports before dag parsing to save time #30495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I like the approach. Few things:
But finally and more importantly:
|
Sure, would be happy to do so. I kinda lack a "representative" set of dags. Maybe I can see if I can use the example dags.
Yes, I was partly aware that parsing python by hand was hacky terrain, but also, thinking about it, I didn't really see examples that would make this code fail. Do you have an actual valid python code example in mind ?
Yes and no. Yes importing everything upfront would probably work (not tested), but it's quite a lot, I don't know if there are other drawbacks to it ? |
I thought of two realistic situations that might trip up your current approach (not sure if it's worth handling but thought I'd mention them). First when there's an example import inside a multi-line string """
<doc string info>
Example DAG:
import airflow...
<more doc string info>
"""
<real DAG>Then when imports are chosen on runtime parameters such as environmental variables: if prod:
import airflow.foo
else:
import airflow.bar |
|
ah yes good examples. I switched to ast.parse() on Jarek's suggestion, it should handle this well. I'll add the multiline comment to test test file. About the if/else, I don't really know how we should handle it. Either we pre-imprt both branches, or we import none. I'd lean towards importing none because this is just an optim, and we want to do the 10% effort to handle 90% of the cases I'd say ? |
Personally I agree, both solutions has pros and cons, but I would err on the side of not importing because that's safer, e.g. it could be behind an if statement because the import has side effects. |
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
|
added some benchmark results in the description. |
2b3c271 to
4303a19
Compare
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. With AST it's much more robust. And after some thinking, importing all airflow modules might be quite too excessive.
I think it is a good change, I am good as it is, but I have a nit/suggestion - I would be slightly more cautious about that one and while it's ok to have it on by default, maybe we should provide an escape hatch with configuration parameter that disables this behaviour ?
I can think of some scenarios, where one of our 650 (slighlty misbehaving) libraries used by some of our providers might cache some information when imported. This change will change the characteristics of that - instead of being imported every time we parse the DAG, it will import it exactly once. Of course - this what we want, but there might be some side-effects of it, we do not realise now and giving the user a flag they can flip to come back to the old behaviour might be a good idea.
Just 2 c.
Yes good idea, it'll offer an easy way out in case this causes trouble. I'm not sure whether I should put it in the |
|
I tihnk scheduler is better |
|
And merged |
--------- Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com> (cherry picked from commit 9fab11c)
--------- Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
A problem was introduced in Airflow 2.6.0. Due to [PR30495](apache/airflow#30495) in Airflow, if any binary files happen to be in the `dags` folder, the dag processor manager will constantly crash, resulting in no DAGs being properly parsed. It's caused by some over-eager parsing of all files, regardless of type. The working fix is to ensure that these binary files are in the .airflowignore file, as mentioned [here](apache/airflow#31519). A different fix was supposed to land in 2.6.1, but having tested this, it does not seem to work (yet). (apache/airflow#31401)
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue. This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3. This helps to reduce the cumulative time spent on imports during serial DAG file parsing. Refs apache#50348 Addresses discussion in apache#49839 Refs apache#30495
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue. This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3. This helps to reduce the cumulative time spent on imports during serial DAG file parsing. Refs apache#50348 Addresses discussion in apache#49839 Refs apache#30495
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue. This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3. This helps to reduce the cumulative time spent on imports during serial DAG file parsing. Refs apache#50348 Addresses discussion in apache#49839 Refs apache#30495
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue. This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3. This helps to reduce the cumulative time spent on imports during serial DAG file parsing. Refs apache#50348 Addresses discussion in apache#49839 Refs apache#30495
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue. This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3. This helps to reduce the cumulative time spent on imports during serial DAG file parsing. Refs apache#50348 Addresses discussion in apache#49839 Refs apache#30495
Imports are evaluated each time we parse a dag because it's done in a separate process, so the modules cache is not shared (and is lost when the process is destroyed after having parsed the dag).
Importing user dependencies in a separate process is good because it provides isolation, but airflow dependencies are something we can pre-import to gain time.
By doing it before we fork, we only have to do it once, it's then in the cache of all child processes.
I'm proposing this code where I read the python file, extract the imports that concern airflow modules, and I import the result using importlib in the processor, just before we span the process that's going to execute the dag file.
Benchmarks
For simple dags that just define a couple operators and plug them together, this showed a ~60% reduction on the time it takes to process a dag file, from around 300ms to around 100ms on my machine.
Running it on a bunch of the example dags we have, I get the following (more diverse) numbers:

method: ran airflow in breeze on my machine, the times from main are an average of 11 parsing passes, the times for this code are an average of 25 parsing passes. I used the number that is sent for the
dag_processing.last_duration.{file_name}metric.__init__is the file that's intests/system/providers/amazon/aws/utils/__init__.py, required for aws example dags to work properly. It doesn't contain dags, but still passes the heuristic (which is good in this case otherwise it'd hide some imports).Why some files get more boost than others, I don't really know. It can be that they do a lot in top-level code, so the imports are relatively smaller in comparison (this is the case for example_s3 and example_sagemaker, which show a good absolute improvement, but not so much in %). It can be that the operators they import didn't pull that many dependencies, so they weren't putting too much drag (probably the case for example_http ?). Or maybe other reasons.
The good news is that this shows improvements all across the board !