Materialize "large" files in a new store location and hardlink them in sandboxes by thejcannon · Pull Request #18153 · pantsbuild/pants

thejcannon · 2023-02-02T02:49:13Z

Fixes #18048 by having store code now store/load from a different root if the file size is above some threshold. Also fixes #18231 by verifying content while downloading.

materialize_file now hardlinks (falling back to copying when the store and workdirs are on separate filesystems) directly from this RO on-disk file if the file doesn't need to be mutable. This means that --no-pantsd and daemon restarts get the benefits of linking.

thejcannon · 2023-02-02T02:49:49Z

WIP due to lots of duplicated code, and because this errors without a clean LMDB store or --no-local-cache

thejcannon · 2023-02-02T02:52:24Z

+      let src = &self.large_file_path(digest);
+      let mut reader = std::fs::File::open(src)
+        .map_err(|e| format!("Failed to read {src:?} for digest {digest:?}: {e}"))?;
+      let mut writer: Vec<u8> = vec![];
+      io::copy(&mut reader, &mut writer)
+        .map_err(|e| format!("Failed to load large file into memory: {e}"))?;
+      Ok(Some(f(&writer[..])))


This will always need to exist (as opposed to forcing the caller to symlink) chiefly for the case that we're materializing a file that shouldn't be immutable. We'll need to copy no matter what.

thejcannon · 2023-02-02T18:17:14Z

Oh hmm docker_tests::immutable_inputs passes locally (and I made sure it is running the code) but fails on CI 😢

thejcannon · 2023-02-02T18:21:05Z

OH I wonder if we're caching the lmdb_store and hitting the bug I was hitting. I'll wait for your findings @stuhood

stuhood · 2023-02-07T00:05:32Z

WIP due to lots of duplicated code, and because this errors without a clean LMDB store or --no-local-cache

How do I repro the failure without a clean LMDB store?

./pants test src/python/pants/util/dirutil_test.py seems to fail with a less conspicuous error than a missing digest.

stuhood

Thanks!

This fixes #17065 by having remote cache loads be able to be streamed to disk. In essence, the remote store now has a `load_file` method in addition to `load_bytes`, and thus the caller can decide to download to a file instead. This doesn't make progress towards #18048 (this PR doesn't touch the local store at all), but I think it will help with integrating the remote store with that code: in theory the `File` could be provided in a way that can be part of the "large file pool" directly (and indeed, the decision about whether to download to a file or into memory ties into that). This also does a theoretically unnecessary extra pass over the data (as discussed in #18231) to verify the digest, but I think it'd make sense to do that as a future optimisation, since it'll require refactoring more deeply (down into `sharded_lmdb` and `hashing`, I think) and is best to build on #18153 once that lands.

kaos · 2023-03-23T18:19:17Z

🎉

…9711) This (hopefully) optimises storing large blobs to a remote cache, by streaming them directly from the file stored on disk in the "FSDB". This builds on the FSDB local store work (#18153), relying on large objects being stored as an immutable file on disk, in the cache managed by Pants. This is an optimisation in several ways: - Cutting out an extra temporary file: - Previously `Store::store_large_blob_remote` would load the whole blob from the local store and then write that to a temporary file. This was appropriate with LMBD-backed blobs. - With new FSDB, there's already a file that can be used, no need for that temporary, and so the file creation and writing overhead can be eliminated . - Reducing sync IO in async tasks, due to mmap: - Previously `ByteStore::store_buffered` would take that temporary file and mmap it, to be able to slice into `Bytes` more efficiently... except this is secretly blocking/sync IO, happening within async tasks (AIUI: when accessing a mmap'd byte that's only on disk, not yet in memory, the whole OS thread is blocked/descheduled while the OS pulls the relevant part of the file into memory, i.e. `tokio` can't run another task on that thread). - This new approach uses normal `tokio` async IO mechanisms to read the file, and thus hopefully this has higher concurrency. - (This also eliminates the unmaintained `memmap` dependency.) I haven't benchmarked this though. My main motivation for this is firming up the provider API before adding new byte store providers, for #11149. This also resolves some TODOs and even eliminates some `unsafe`, yay! The commits are individually reviewable. Fixes #19049, fixes #14341 (`memmap` removed), closes #17234 (solves the same problem but with an approach that wasn't possible at the time).

…ame device (#19894) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757.

…ame device (pantsbuild#19894) As described in pantsbuild#18757, the "check that source and dest are on same device" strategy that was introduced in pantsbuild#18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes pantsbuild#18757.

…ame device (#19894) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757.

…ame device (Cherry-pick of #19894) (#19910) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757. Co-authored-by: Stu Hood <stuhood@gmail.com>

…ame device (Cherry-pick of #19894) (#19914) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757.

…ts (#20055) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765.

…ts (Cherry-pick of #20055) (#20078) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <stuhood@gmail.com>

…ts (Cherry-pick of #20055) (#20077) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <stuhood@gmail.com>

thejcannon added 3 commits February 1, 2023 11:13

dont check this in lol

2cde378

it works!

66eb572

bye bye

ec76537

thejcannon added the category:performance label Feb 2, 2023

thejcannon requested review from stuhood and tdyas February 2, 2023 02:49

thejcannon marked this pull request as draft February 2, 2023 02:49

thats gone too

807f4f5

thejcannon commented Feb 2, 2023

View reviewed changes

thejcannon added 2 commits February 2, 2023 10:53

thats working better

383de69

call it immutable

e0bc8d7

thejcannon changed the title ~~WIP: Materialize "large" files in a new store location and symlink them in sandboxes~~ Materialize "large" files in a new store location and symlink them in sandboxes Feb 2, 2023

thejcannon marked this pull request as ready for review February 2, 2023 16:58

some test fixes

655073a

huonw reviewed Feb 5, 2023

View reviewed changes

Comment thread src/rust/engine/fs/store/src/local.rs Outdated

Comment thread src/rust/engine/fs/store/src/local.rs Outdated

stuhood reviewed Feb 7, 2023

View reviewed changes

huonw mentioned this pull request Feb 12, 2023

Stream large remote cache downloads directly to disk #18054

Merged

thejcannon added 6 commits February 13, 2023 09:38

Merge branch 'main' into bigfilestore

17d614e

tweak

b7f6341

a few changes

5fc2711

oh hey its async now

052b91a

remote works for small files too

8779e31

more tests pass

f14c3c2

huonw reviewed Feb 15, 2023

View reviewed changes

Comment thread src/rust/engine/fs/store/src/lib.rs

lets start crackin'

7e52590

fmt

631cb6f

stuhood merged commit c36a988 into pantsbuild:main Mar 23, 2023

thejcannon deleted the bigfilestore branch March 23, 2023 16:23

This was referenced Apr 3, 2023

"Failed to create hardlink" due to "No such file or directory" in 2.17.0.dev #18661

Closed

More attempts to get CI to be less flaky #18652

Closed

This was referenced May 5, 2023

Ideas for better handling of huge files in LMDB store #16697

Closed

i/o error while capturing larger artifact #13401

Closed

huonw mentioned this pull request May 19, 2023

Upload large files directly from disk, without loading into memory #19049

Closed

huonw mentioned this pull request Jun 3, 2023

initial draft of What's New for v2.17.x #19168

Merged

This was referenced Aug 30, 2023

Stream large blobs to remote cache directly from local cache file #19711

Merged

pants run and pants package failing with backtrack due to missing digest #19748

Open

stuhood mentioned this pull request Sep 20, 2023

Fix "failed to create hardlink" error due to multiple mounts on the same device #19894

Merged

WorkerPants mentioned this pull request Sep 22, 2023

Fix "failed to create hardlink" error due to multiple mounts on the same device (Cherry-pick of #19894) #19910

Merged

stuhood mentioned this pull request Sep 22, 2023

Fix "failed to create hardlink" error due to multiple mounts on the same device (Cherry-pick of #19894) #19914

Merged

stuhood mentioned this pull request Oct 18, 2023

Add best-effort limits on async file opens to reduce file handle counts #20055

Merged

This was referenced Oct 24, 2023

Add best-effort limits on async file opens to reduce file handle counts (Cherry-pick of #20055) #20077

Merged

Add best-effort limits on async file opens to reduce file handle counts (Cherry-pick of #20055) #20078

Merged

Uh oh!

Conversation

thejcannon commented Feb 2, 2023 • edited by stuhood Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thejcannon commented Feb 2, 2023

Uh oh!

Uh oh!

thejcannon Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

thejcannon commented Feb 2, 2023

Uh oh!

thejcannon commented Feb 2, 2023

Uh oh!

Uh oh!

Uh oh!

stuhood commented Feb 7, 2023

Uh oh!

stuhood left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaos commented Mar 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thejcannon commented Feb 2, 2023 •

edited by stuhood

Loading