Materialize "large" files in a new store location and hardlink them in sandboxes#18153
Materialize "large" files in a new store location and hardlink them in sandboxes#18153stuhood merged 38 commits intopantsbuild:mainfrom
Conversation
|
WIP due to lots of duplicated code, and because this errors without a clean LMDB store or |
| let src = &self.large_file_path(digest); | ||
| let mut reader = std::fs::File::open(src) | ||
| .map_err(|e| format!("Failed to read {src:?} for digest {digest:?}: {e}"))?; | ||
| let mut writer: Vec<u8> = vec![]; | ||
| io::copy(&mut reader, &mut writer) | ||
| .map_err(|e| format!("Failed to load large file into memory: {e}"))?; | ||
| Ok(Some(f(&writer[..]))) |
There was a problem hiding this comment.
This will always need to exist (as opposed to forcing the caller to symlink) chiefly for the case that we're materializing a file that shouldn't be immutable. We'll need to copy no matter what.
|
Oh hmm |
|
OH I wonder if we're caching the |
How do I repro the failure without a clean LMDB store?
|
This fixes #17065 by having remote cache loads be able to be streamed to disk. In essence, the remote store now has a `load_file` method in addition to `load_bytes`, and thus the caller can decide to download to a file instead. This doesn't make progress towards #18048 (this PR doesn't touch the local store at all), but I think it will help with integrating the remote store with that code: in theory the `File` could be provided in a way that can be part of the "large file pool" directly (and indeed, the decision about whether to download to a file or into memory ties into that). This also does a theoretically unnecessary extra pass over the data (as discussed in #18231) to verify the digest, but I think it'd make sense to do that as a future optimisation, since it'll require refactoring more deeply (down into `sharded_lmdb` and `hashing`, I think) and is best to build on #18153 once that lands.
|
🎉 |
…9711) This (hopefully) optimises storing large blobs to a remote cache, by streaming them directly from the file stored on disk in the "FSDB". This builds on the FSDB local store work (#18153), relying on large objects being stored as an immutable file on disk, in the cache managed by Pants. This is an optimisation in several ways: - Cutting out an extra temporary file: - Previously `Store::store_large_blob_remote` would load the whole blob from the local store and then write that to a temporary file. This was appropriate with LMBD-backed blobs. - With new FSDB, there's already a file that can be used, no need for that temporary, and so the file creation and writing overhead can be eliminated . - Reducing sync IO in async tasks, due to mmap: - Previously `ByteStore::store_buffered` would take that temporary file and mmap it, to be able to slice into `Bytes` more efficiently... except this is secretly blocking/sync IO, happening within async tasks (AIUI: when accessing a mmap'd byte that's only on disk, not yet in memory, the whole OS thread is blocked/descheduled while the OS pulls the relevant part of the file into memory, i.e. `tokio` can't run another task on that thread). - This new approach uses normal `tokio` async IO mechanisms to read the file, and thus hopefully this has higher concurrency. - (This also eliminates the unmaintained `memmap` dependency.) I haven't benchmarked this though. My main motivation for this is firming up the provider API before adding new byte store providers, for #11149. This also resolves some TODOs and even eliminates some `unsafe`, yay! The commits are individually reviewable. Fixes #19049, fixes #14341 (`memmap` removed), closes #17234 (solves the same problem but with an approach that wasn't possible at the time).
…ame device (#19894) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757.
…ame device (#19894) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757.
…ame device (pantsbuild#19894) As described in pantsbuild#18757, the "check that source and dest are on same device" strategy that was introduced in pantsbuild#18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes pantsbuild#18757.
…ame device (#19894) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757.
…ame device (Cherry-pick of #19894) (#19910) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757. Co-authored-by: Stu Hood <stuhood@gmail.com>
…ame device (Cherry-pick of #19894) (#19914) As described in #18757, the "check that source and dest are on same device" strategy that was introduced in #18153 to decide whether we could hardlink when materializing files was not robust when faced with the same device being mounted in multiple locations. This change moves to a "create a canary" strategy for deciding when hardlinking between two destinations is legal. Hardlinks are canaried and memoized on a per-destination-root basis, so this strategy might actually be slightly cheaper than the previous one. Fixes #18757.
…ts (#20055) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765.
…ts (#20055) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765.
…ts (#20055) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765.
…ts (Cherry-pick of #20055) (#20078) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <stuhood@gmail.com>
…ts (Cherry-pick of #20055) (#20077) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <stuhood@gmail.com>
Fixes #18048 by having
storecode now store/load from a different root if the file size is above some threshold. Also fixes #18231 by verifying content while downloading.materialize_filenow hardlinks (falling back to copying when the store and workdirs are on separate filesystems) directly from this RO on-disk file if the file doesn't need to be mutable. This means that--no-pantsdand daemon restarts get the benefits of linking.