preload airflow imports before dag parsing to save time #30495

vandonr-amz · 2023-04-05T23:46:58Z

Imports are evaluated each time we parse a dag because it's done in a separate process, so the modules cache is not shared (and is lost when the process is destroyed after having parsed the dag).
Importing user dependencies in a separate process is good because it provides isolation, but airflow dependencies are something we can pre-import to gain time.
By doing it before we fork, we only have to do it once, it's then in the cache of all child processes.

I'm proposing this code where I read the python file, extract the imports that concern airflow modules, and I import the result using importlib in the processor, just before we span the process that's going to execute the dag file.

Benchmarks

For simple dags that just define a couple operators and plug them together, this showed a ~60% reduction on the time it takes to process a dag file, from around 300ms to around 100ms on my machine.

Running it on a bunch of the example dags we have, I get the following (more diverse) numbers:

method: ran airflow in breeze on my machine, the times from main are an average of 11 parsing passes, the times for this code are an average of 25 parsing passes. I used the number that is sent for the dag_processing.last_duration.{file_name} metric.
__init__ is the file that's in tests/system/providers/amazon/aws/utils/__init__.py, required for aws example dags to work properly. It doesn't contain dags, but still passes the heuristic (which is good in this case otherwise it'd hide some imports).

Why some files get more boost than others, I don't really know. It can be that they do a lot in top-level code, so the imports are relatively smaller in comparison (this is the case for example_s3 and example_sagemaker, which show a good absolute improvement, but not so much in %). It can be that the operators they import didn't pull that many dependencies, so they weren't putting too much drag (probably the case for example_http ?). Or maybe other reasons.
The good news is that this shows improvements all across the board !

potiuk · 2023-04-06T07:17:41Z

I like the approach. Few things:

I think we will need some more detailed benchmarks
I think there is a flaw in just reading lines from python code, there are - theorethically at lest - cases where imports are broken across multiple lines. Reading lines from python sounds a bit hack-ish (AST parsing would be a bit slower but much better).

But finally and more importantly:

Should not it be better to import all airlfow packages upfront (excluding providers maybe)? That sounds like a much more robust solution, you only do it once at the start of DAG file processor and you can even stop parsing DAG files at this point.

airflow/utils/file.py

airflow/dag_processing/processor.py

vandonr-amz · 2023-04-06T16:48:33Z

1. I think we will need some more detailed benchmarks

Sure, would be happy to do so. I kinda lack a "representative" set of dags. Maybe I can see if I can use the example dags.
I tested this mostly with breeze, is there other configurations you'd like to see in a benchmark ?

2. I think there is a flaw in just reading lines from python code, there are - theorethically at lest - cases where imports are broken across multiple lines. Reading lines from python sounds a bit hack-ish (AST parsing would be a bit slower but much better).

Yes, I was partly aware that parsing python by hand was hacky terrain, but also, thinking about it, I didn't really see examples that would make this code fail. Do you have an actual valid python code example in mind ?
I'll take a look at AST parsing, thanks for the pointer.

3. Should not it be better to import all airflow packages upfront (excluding providers maybe)? That sounds like a much more robust solution, you only do it once at the start of DAG file processor and you can even stop parsing DAG files at this point.

Yes and no. Yes importing everything upfront would probably work (not tested), but it's quite a lot, I don't know if there are other drawbacks to it ?
And if we don't import providers code, then we lose a good chunk of the performance gains from this PR because I imagine as soon as users are going to use operators, they're going to import provider packages, which is slow too.
It'd still be better than before, but not as fast as it could be.

notatallshaw-gts · 2023-04-10T17:23:54Z

Yes, I was partly aware that parsing python by hand was hacky terrain, but also, thinking about it, I didn't really see examples that would make this code fail. Do you have an actual valid python code example in mind ?

I thought of two realistic situations that might trip up your current approach (not sure if it's worth handling but thought I'd mention them).

First when there's an example import inside a multi-line string

"""
<doc string info>
Example DAG:

import airflow...

<more doc string info>
"""

<real DAG>

Then when imports are chosen on runtime parameters such as environmental variables:

if prod:
    import airflow.foo
else:
    import airflow.bar

vandonr-amz · 2023-04-10T18:20:15Z

ah yes good examples. I switched to ast.parse() on Jarek's suggestion, it should handle this well. I'll add the multiline comment to test test file.

About the if/else, I don't really know how we should handle it. Either we pre-imprt both branches, or we import none. I'd lean towards importing none because this is just an optim, and we want to do the 10% effort to handle 90% of the cases I'd say ?

notatallshaw-gts · 2023-04-10T18:34:45Z

About the if/else, I don't really know how we should handle it. Either we pre-imprt both branches, or we import none. I'd lean towards importing none because this is just an optim, and we want to do the 10% effort to handle 90% of the cases I'd say ?

Personally I agree, both solutions has pros and cons, but I would err on the side of not importing because that's safer, e.g. it could be behind an if statement because the import has side effects.

airflow/utils/file.py

airflow/dag_processing/processor.py

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

vandonr-amz · 2023-04-11T20:00:39Z

added some benchmark results in the description.

airflow/dag_processing/processor.py

Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>

airflow/dag_processing/processor.py

tests/dags/test_imports.nopy

potiuk

Yeah. With AST it's much more robust. And after some thinking, importing all airflow modules might be quite too excessive.

I think it is a good change, I am good as it is, but I have a nit/suggestion - I would be slightly more cautious about that one and while it's ok to have it on by default, maybe we should provide an escape hatch with configuration parameter that disables this behaviour ?

I can think of some scenarios, where one of our 650 (slighlty misbehaving) libraries used by some of our providers might cache some information when imported. This change will change the characteristics of that - instead of being imported every time we parse the DAG, it will import it exactly once. Of course - this what we want, but there might be some side-effects of it, we do not realise now and giving the user a flag they can flip to come back to the old behaviour might be a good idea.

Just 2 c.

vandonr-amz · 2023-04-14T15:30:05Z

maybe we should provide an escape hatch with configuration parameter that disables this behaviour ?

Yes good idea, it'll offer an easy way out in case this causes trouble.

I'm not sure whether I should put it in the [core] section (with dag_file_processor_timeout for instance) or in the [scheduler] section (with file_parsing_sort_mode among others)...

potiuk · 2023-04-14T15:43:22Z

I tihnk scheduler is better

airflow/config_templates/config.yml

potiuk · 2023-04-14T17:22:39Z

And merged

--------- Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com> (cherry picked from commit 9fab11c)

--------- Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>

A problem was introduced in Airflow 2.6.0. Due to [PR30495](apache/airflow#30495) in Airflow, if any binary files happen to be in the `dags` folder, the dag processor manager will constantly crash, resulting in no DAGs being properly parsed. It's caused by some over-eager parsing of all files, regardless of type. The working fix is to ensure that these binary files are in the .airflowignore file, as mentioned [here](apache/airflow#31519). A different fix was supposed to land in 2.6.1, but having tested this, it does not seem to work (yet). (apache/airflow#31401)

Airflow 2.6.0 introduced the `AIRFLOW__SCHEDULER__PARSING_PRE_IMPORT_MODULES` setting (apache#30495) to pre-import commonly used modules in the DAG File Processor parent process before forking, which mitigated this performance issue. This optimization logic and the corresponding setting were not carried over to Airflow 3 and the setting is currently ignored (as noted in apache#49839). While apache#49839 proposed removing the setting as unused, this commit re-implements the underlying pre-import optimization functionality to restore this performance benefit in Airflow 3. This helps to reduce the cumulative time spent on imports during serial DAG file parsing. Refs apache#50348 Addresses discussion in apache#49839 Refs apache#30495

preload airflow imports before dag parsing to save time

52e96b5

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Apr 5, 2023

uranusjr reviewed Apr 6, 2023

View reviewed changes

airflow/utils/file.py Outdated Show resolved Hide resolved

notatallshaw-gts reviewed Apr 6, 2023

View reviewed changes

airflow/dag_processing/processor.py Outdated Show resolved Hide resolved

handle exceptions, use ast parsing, add test

d5bbba6

add import in multiline comment to test

e05a654

vandonr-amz added 2 commits April 10, 2023 11:41

add conditional import to test, add error handling

ebde5e1

fix test & static check

2064f46

uranusjr reviewed Apr 11, 2023

View reviewed changes

airflow/utils/file.py Outdated Show resolved Hide resolved

uranusjr reviewed Apr 11, 2023

View reviewed changes

airflow/dag_processing/processor.py Outdated Show resolved Hide resolved

vandonr-amz and others added 2 commits April 11, 2023 09:53

rewrite a bit the method in utils/file

7b572db

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

un-interpolate logged string

a2df518

fix following method rename

4303a19

vandonr-amz marked this pull request as ready for review April 11, 2023 20:10

vandonr-amz requested a review from jedcunningham as a code owner April 11, 2023 20:10

vandonr-amz force-pushed the vandonr/nice branch from 2b3c271 to 4303a19 Compare April 11, 2023 20:10

vandonr-amz requested a review from ephraimbuddy as a code owner April 11, 2023 20:10

vandonr-amz added 2 commits April 11, 2023 14:27

more fixing

1b28ab6

Merge remote-tracking branch 'origin/main' into vandonr/nice

ae3db5e

ephraimbuddy reviewed Apr 11, 2023

View reviewed changes

airflow/dag_processing/processor.py Outdated Show resolved Hide resolved

vandonr-amz and others added 2 commits April 11, 2023 15:07

Fix wording in warn on import error

9bde048

Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>

Merge branch 'main' into vandonr/nice

c066867

uranusjr reviewed Apr 13, 2023

View reviewed changes

airflow/dag_processing/processor.py Outdated Show resolved Hide resolved

uranusjr reviewed Apr 13, 2023

View reviewed changes

tests/dags/test_imports.nopy Outdated Show resolved Hide resolved

vandonr-amz added 2 commits April 13, 2023 16:40

exclude test file from mypy core as well

f0c89cd

test fix

3d309de

uranusjr approved these changes Apr 14, 2023

View reviewed changes

potiuk approved these changes Apr 14, 2023

View reviewed changes

add config flag to control the bahvior

6a53f30

vandonr-amz commented Apr 14, 2023

View reviewed changes

airflow/config_templates/config.yml Show resolved Hide resolved

potiuk added this to the Airflow 2.6.0 milestone Apr 14, 2023

potiuk approved these changes Apr 14, 2023

View reviewed changes

potiuk merged commit 9fab11c into apache:main Apr 14, 2023

vandonr-amz deleted the vandonr/nice branch April 14, 2023 17:33

ephraimbuddy added the type:improvement Changelog: Improvements label Apr 14, 2023

wookiist pushed a commit to wookiist/airflow that referenced this pull request Apr 19, 2023

preload airflow imports before dag parsing to save time (apache#30495)

2732fde

--------- Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>

luos-fc mentioned this pull request May 3, 2023

Packaged DAGs not getting loaded in Airflow 2.6.0 (ValueError: source code string cannot contain null bytes) #31039

Closed

2 tasks

jedcunningham mentioned this pull request May 18, 2023

Fix error handling when pre-importing modules in DAGs #31401

Merged

danielvdende mentioned this pull request May 26, 2023

Fix unparseable DBT examples godatadriven/whirl#91

Merged

eladkal mentioned this pull request Jun 13, 2023

Status of testing of Apache Airflow 2.6.2rc2 #31867

Closed

62 tasks

eladkal mentioned this pull request Apr 27, 2025

Cleanup parsing_pre_import_modules setting #49839

Closed

Lzzz666 mentioned this pull request May 8, 2025

Add back dag parsing pre-import optimization #50371

Merged

preload airflow imports before dag parsing to save time #30495

preload airflow imports before dag parsing to save time #30495

Uh oh!

Conversation

vandonr-amz commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

potiuk commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vandonr-amz commented Apr 6, 2023

Uh oh!

notatallshaw-gts commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vandonr-amz commented Apr 10, 2023

Uh oh!

notatallshaw-gts commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vandonr-amz commented Apr 11, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

potiuk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vandonr-amz commented Apr 14, 2023

Uh oh!

potiuk commented Apr 14, 2023

Uh oh!

Uh oh!

potiuk commented Apr 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vandonr-amz commented Apr 5, 2023 •

edited

Loading

potiuk commented Apr 6, 2023 •

edited

Loading

notatallshaw-gts commented Apr 10, 2023 •

edited

Loading

notatallshaw-gts commented Apr 10, 2023 •

edited

Loading

potiuk left a comment •

edited

Loading