Skip to content

Conversation

@uranusjr
Copy link
Member

importlib.metadata.distributions() reads information from the actual installations, which is a lot of IO that we can avoid by caching.

The benefit of this depends on how many packages you have in your installation. It’s nearly zero with a bare Airflow installation, and I observed a ~7% save (17s to 16s) for the webserver to finish init (launch until when_ready is emitted) in a setup with all official providers installed.

The downside is we are now persisting a lot of small objects in memory. I wonder whether there’s a good time we can purge those.

@uranusjr uranusjr force-pushed the cache-importlib-metadata branch 2 times, most recently from 55945e1 to b83d310 Compare February 20, 2023 10:35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was profling memory usage (not speed), and this seemed to be a cause of a lot of bloat -- and it's suprisingly expensive to get the dist name (involves parsing a lot of files for every dist in airflow).

I think there are three options here:

  1. Only compute this key if the entrypoint group matches (this limits the expensive operation to just dists we actually care about, instead of all
  2. Use the path component to be the cache key (see below)
  3. Cache based on the entrypoint classpath
  4. Don't cache at all. This only catches a case when you have multiple copies of the same dist in multiple cases (which is rare to hit outside of being an Airflow developer anyway).

On point 2, it doesn't seem possible to do this using public methods, but this:

In [14]: d._path
Out[14]: PosixPath('/home/ash/airflow/.venv/lib/python3.11/site-packages/greenlet-2.0.2.dist-info')

In [15]: d._path.stem
Out[15]: 'greenlet-2.0.2'

Doing this might break for less common ways of shipping dists though, so it's likely not a good option, not without a fallback anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the callers and currently this is used by loading airflow.plugins and apache_airflow_provider. Both of these implement their own deduplication logic, so I think it is safe to remove this entirely. Although this would not actually help the latter case, which still accesses metadata anyway…

@uranusjr uranusjr requested a review from ashb February 21, 2023 05:34
@uranusjr uranusjr force-pushed the cache-importlib-metadata branch from 449de1f to 7ad6b88 Compare February 21, 2023 17:08
importlib.metadata.distributions() reads information from the actual
installations, which is a lot of IO that we can avoid by caching.
@uranusjr uranusjr force-pushed the cache-importlib-metadata branch from 7ad6b88 to ec0fa55 Compare February 22, 2023 05:29
@uranusjr uranusjr marked this pull request as ready for review February 22, 2023 05:29
@uranusjr uranusjr merged commit 9f51845 into apache:main Feb 25, 2023
@uranusjr uranusjr deleted the cache-importlib-metadata branch February 25, 2023 05:36
@pierrejeambrun pierrejeambrun added this to the Airflow 2.5.2 milestone Feb 27, 2023
@pierrejeambrun pierrejeambrun added the type:misc/internal Changelog: Misc changes that should appear in change log label Feb 27, 2023
pierrejeambrun pushed a commit that referenced this pull request Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:misc/internal Changelog: Misc changes that should appear in change log

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants