Skip to content

proposal: pex3 cache introspection/gc command #2201

@cosmicexplorer

Description

@cosmicexplorer

Discussed in https://github.com/pantsbuild/pex/discussions/2200

Originally posted by cosmicexplorer August 1, 2023
forked from a response to #2175 (comment):

There is an expense here in ~duplicating cached zips and Pants / Pex are already both notorious amongst users for excessive cache sizes
Without that, this feature definitely needs to be behind a flag (--i-opt-in-to-cache-doubling - clearly not spelled like that!). Now you already mentioned being behind a flag, so I think you're on board there.

Cache GC Policies

Generalizing this a bit, I recall that pantsd used to have a flag for how often it garbage collects the rust store--if there are concerns about the bloat of pex cache directories, are there any opportunities for pex itself to help the user automate the cache management outside of just rm -rf ~/.pex? What is currently the easiest way to implement e.g. LRU eviction? I guess I can do something like this?

> find ~/.pex -atime '+30' -or -atime '+7' -size '+300M' -type f -exec rm -rf '+'

The above probably works, but I'm wondering if the dilemma about cache bloat that you describe is partially because the user isn't given enough tools to mediate it? Or am I misinterpreting you?

Insight: evict cache entries based on usage frequency

In particular, one GC heuristic that pex (or pip) itself would be in the best place to record is not just how recently each cache entry was accessed, but how often. Something like this could be fun:

> pex3 cache evict -accessed '>30 days' -or \( -size '>300M' -accessed '<1 per 1 day' \)
448M    ~/.pex/stitched_dists/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
...
2.5GB to be removed. Delete? [Y/n] y
2.5GB deleted.

Does that sound like a fruitful thing to investigate further? Or are there better ways to address the disk usage pressure?

Prior Art

Examples of this from other tools:

pip example

One useful bit of prior art is the new pip cache subcommand within pip (it's on the main branch, not sure which version it first appeared in):

> PYTHONPATH="$(pwd)/dist/pip-23.3.dev0-py3-none-any.whl:${PYTHONPATH}" python3.12 -m pip cache list --help

Usage:   
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache dir
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache info
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache list [<pattern>] [--format=[human, abspath]]
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache remove <pattern>
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache purge
  

Description:
  Inspect and manage pip's wheel cache.
  
  Subcommands:
  
  - dir: Show the cache directory.
  - info: Show information about the cache.
  - list: List filenames of packages stored in the cache.
  - remove: Remove one or more package from the cache.
  - purge: Remove all items from the cache.
  
  ``<pattern>`` can be a glob expression or a package name.

Cache Options:
  --format <list_format>      Select the output format among: human (default) or abspath
# ...
> PYTHONPATH="$(pwd)/dist/pip-23.3.dev0-py3-none-any.whl:${PYTHONPATH}" python3.12 -m pip cache list 'wheel*'
Cache contents:

 - wheel-0.40.0-py3-none-any.whl (64 kB)
 - wheel-0.40.0-py3-none-any.whl (64 kB)
 - wheel-0.41.0-py3-none-any.whl (64 kB)
 - wheel-0.41.0-py3-none-any.whl (64 kB)

spack comparison

I know spack users also have the same issue, but it's less pressing because:

  1. spack's filesystem usage is largely dominated by the contents of the packages it installs, which (because they do not come in the formats expected by standard package repositories) are often so large that caching like pex or pants does would much more quickly result in uncomfortable disk usage.
  2. spack specs allow very powerful queries, which makes it easier to implement e.g. "uninstall all versions of emacs without the tree-sitter library" (that looks like spack uninstall 'emacs~tree-sitter') or "anything compiled by a version of clang less than or equal to X.Y.Z and any transitive dependees" (that looks like spack uninstall --all '%clang@:X.Y.Z') by deferring to the clingo ASP logic solver (e.g. https://github.com/spack/spack/blob/936c6045fc0686e683c6b3da20967d2e30a7ec87/lib/spack/spack/solver/concretize.lp#L7).

So spack users generally have the ability to very finely tune the tool's disk usage to suit their own immediate needs, and pruning or even seeding a cache e.g. for export to an internal environment is considered a top-level feature. While pex (and especially pex3) also make the creation of python environments a top-level feature, we currently aren't able to apply the same selection logic to prune our cache directories.

Insight: select cache entries to evict using our existing platform/interpreter selection logic

Along those lines, to expand on the proposed pex3 cache command, we could introduce platform selection logic:

> pex3 cache evict -platform 'linux'
448M    ~/.pex/stitched_dists/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
...
2.5GB to be removed. Delete? [Y/n] y
2.5GB deleted.
```</div>

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions