Skip to content

Conversation

@vandonr-amz
Copy link
Contributor

@vandonr-amz vandonr-amz commented Apr 5, 2023

Imports are evaluated each time we parse a dag because it's done in a separate process, so the modules cache is not shared (and is lost when the process is destroyed after having parsed the dag).
Importing user dependencies in a separate process is good because it provides isolation, but airflow dependencies are something we can pre-import to gain time.
By doing it before we fork, we only have to do it once, it's then in the cache of all child processes.

I'm proposing this code where I read the python file, extract the imports that concern airflow modules, and I import the result using importlib in the processor, just before we span the process that's going to execute the dag file.


Benchmarks

For simple dags that just define a couple operators and plug them together, this showed a ~60% reduction on the time it takes to process a dag file, from around 300ms to around 100ms on my machine.

Running it on a bunch of the example dags we have, I get the following (more diverse) numbers:
image
method: ran airflow in breeze on my machine, the times from main are an average of 11 parsing passes, the times for this code are an average of 25 parsing passes. I used the number that is sent for the dag_processing.last_duration.{file_name} metric.
__init__ is the file that's in tests/system/providers/amazon/aws/utils/__init__.py, required for aws example dags to work properly. It doesn't contain dags, but still passes the heuristic (which is good in this case otherwise it'd hide some imports).

Why some files get more boost than others, I don't really know. It can be that they do a lot in top-level code, so the imports are relatively smaller in comparison (this is the case for example_s3 and example_sagemaker, which show a good absolute improvement, but not so much in %). It can be that the operators they import didn't pull that many dependencies, so they weren't putting too much drag (probably the case for example_http ?). Or maybe other reasons.
The good news is that this shows improvements all across the board !

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Apr 5, 2023
@potiuk
Copy link
Member

potiuk commented Apr 6, 2023

I like the approach. Few things:

  1. I think we will need some more detailed benchmarks
  2. I think there is a flaw in just reading lines from python code, there are - theorethically at lest - cases where imports are broken across multiple lines. Reading lines from python sounds a bit hack-ish (AST parsing would be a bit slower but much better).

But finally and more importantly:

  1. Should not it be better to import all airlfow packages upfront (excluding providers maybe)? That sounds like a much more robust solution, you only do it once at the start of DAG file processor and you can even stop parsing DAG files at this point.

@vandonr-amz
Copy link
Contributor Author

1. I think we will need some more detailed benchmarks

Sure, would be happy to do so. I kinda lack a "representative" set of dags. Maybe I can see if I can use the example dags.
I tested this mostly with breeze, is there other configurations you'd like to see in a benchmark ?

2. I think there is a flaw in just reading lines from python code, there are - theorethically at lest - cases where imports are broken across multiple lines. Reading lines from python sounds a bit hack-ish (AST parsing would be a bit slower but much better).

Yes, I was partly aware that parsing python by hand was hacky terrain, but also, thinking about it, I didn't really see examples that would make this code fail. Do you have an actual valid python code example in mind ?
I'll take a look at AST parsing, thanks for the pointer.

3. Should not it be better to import all airflow packages upfront (excluding providers maybe)? That sounds like a much more robust solution, you only do it once at the start of DAG file processor and you can even stop parsing DAG files at this point.

Yes and no. Yes importing everything upfront would probably work (not tested), but it's quite a lot, I don't know if there are other drawbacks to it ?
And if we don't import providers code, then we lose a good chunk of the performance gains from this PR because I imagine as soon as users are going to use operators, they're going to import provider packages, which is slow too.
It'd still be better than before, but not as fast as it could be.

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Apr 10, 2023

Yes, I was partly aware that parsing python by hand was hacky terrain, but also, thinking about it, I didn't really see examples that would make this code fail. Do you have an actual valid python code example in mind ?

I thought of two realistic situations that might trip up your current approach (not sure if it's worth handling but thought I'd mention them).

First when there's an example import inside a multi-line string

"""
<doc string info>
Example DAG:

import airflow...

<more doc string info>
"""

<real DAG>

Then when imports are chosen on runtime parameters such as environmental variables:

if prod:
    import airflow.foo
else:
    import airflow.bar

@vandonr-amz
Copy link
Contributor Author

ah yes good examples. I switched to ast.parse() on Jarek's suggestion, it should handle this well. I'll add the multiline comment to test test file.

About the if/else, I don't really know how we should handle it. Either we pre-imprt both branches, or we import none. I'd lean towards importing none because this is just an optim, and we want to do the 10% effort to handle 90% of the cases I'd say ?

@notatallshaw-gts
Copy link
Contributor

notatallshaw-gts commented Apr 10, 2023

About the if/else, I don't really know how we should handle it. Either we pre-imprt both branches, or we import none. I'd lean towards importing none because this is just an optim, and we want to do the 10% effort to handle 90% of the cases I'd say ?

Personally I agree, both solutions has pros and cons, but I would err on the side of not importing because that's safer, e.g. it could be behind an if statement because the import has side effects.

vandonr-amz and others added 2 commits April 11, 2023 09:53
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
@vandonr-amz
Copy link
Contributor Author

added some benchmark results in the description.

@vandonr-amz vandonr-amz marked this pull request as ready for review April 11, 2023 20:10
vandonr-amz and others added 2 commits April 11, 2023 15:07
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. With AST it's much more robust. And after some thinking, importing all airflow modules might be quite too excessive.

I think it is a good change, I am good as it is, but I have a nit/suggestion - I would be slightly more cautious about that one and while it's ok to have it on by default, maybe we should provide an escape hatch with configuration parameter that disables this behaviour ?

I can think of some scenarios, where one of our 650 (slighlty misbehaving) libraries used by some of our providers might cache some information when imported. This change will change the characteristics of that - instead of being imported every time we parse the DAG, it will import it exactly once. Of course - this what we want, but there might be some side-effects of it, we do not realise now and giving the user a flag they can flip to come back to the old behaviour might be a good idea.

Just 2 c.

@vandonr-amz
Copy link
Contributor Author

maybe we should provide an escape hatch with configuration parameter that disables this behaviour ?

Yes good idea, it'll offer an easy way out in case this causes trouble.

I'm not sure whether I should put it in the [core] section (with dag_file_processor_timeout for instance) or in the [scheduler] section (with file_parsing_sort_mode among others)...

@potiuk
Copy link
Member

potiuk commented Apr 14, 2023

I tihnk scheduler is better

@potiuk potiuk added this to the Airflow 2.6.0 milestone Apr 14, 2023
@potiuk potiuk merged commit 9fab11c into apache:main Apr 14, 2023
@potiuk
Copy link
Member

potiuk commented Apr 14, 2023

And merged

@vandonr-amz vandonr-amz deleted the vandonr/nice branch April 14, 2023 17:33
@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Apr 14, 2023
ephraimbuddy added a commit that referenced this pull request Apr 14, 2023
---------

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
(cherry picked from commit 9fab11c)
wookiist pushed a commit to wookiist/airflow that referenced this pull request Apr 19, 2023

---------

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
danielvdende added a commit to godatadriven/whirl that referenced this pull request May 26, 2023
A problem was introduced in Airflow 2.6.0. Due to [PR30495](apache/airflow#30495)
in Airflow, if any binary files happen to be in the `dags` folder,
the dag processor manager will constantly crash, resulting in
no DAGs being properly parsed. It's caused by some over-eager
parsing of all files, regardless of type.

The working fix is to ensure that these binary files are in the .airflowignore
file, as mentioned [here](apache/airflow#31519).

A different fix was supposed to land in 2.6.1, but having tested this,
it does not seem to work (yet). (apache/airflow#31401)
Lzzz666 added a commit to Lzzz666/airflow that referenced this pull request May 8, 2025
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue.

This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3.

This helps to reduce the cumulative time spent on imports during serial DAG file parsing.

Refs apache#50348
Addresses discussion in apache#49839
Refs apache#30495
Lzzz666 added a commit to Lzzz666/airflow that referenced this pull request Jun 24, 2025
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue.

This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3.

This helps to reduce the cumulative time spent on imports during serial DAG file parsing.

Refs apache#50348
Addresses discussion in apache#49839
Refs apache#30495
Lzzz666 added a commit to Lzzz666/airflow that referenced this pull request Jun 24, 2025
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue.

This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3.

This helps to reduce the cumulative time spent on imports during serial DAG file parsing.

Refs apache#50348
Addresses discussion in apache#49839
Refs apache#30495
Lzzz666 added a commit to Lzzz666/airflow that referenced this pull request Jun 24, 2025
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue.

This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3.

This helps to reduce the cumulative time spent on imports during serial DAG file parsing.

Refs apache#50348
Addresses discussion in apache#49839
Refs apache#30495
Lzzz666 added a commit to Lzzz666/airflow that referenced this pull request Jun 24, 2025
Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue.

This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3.

This helps to reduce the cumulative time spent on imports during serial DAG file parsing.

Refs apache#50348
Addresses discussion in apache#49839
Refs apache#30495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler type:improvement Changelog: Improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants