Skip to content

Remove forced copying in local load_bytes_with#9793

Merged
stuhood merged 3 commits into
pantsbuild:masterfrom
stuhood:stuhood/remove-copy-in-load-bytes-with
May 16, 2020
Merged

Remove forced copying in local load_bytes_with#9793
stuhood merged 3 commits into
pantsbuild:masterfrom
stuhood:stuhood/remove-copy-in-load-bytes-with

Conversation

@stuhood
Copy link
Copy Markdown
Member

@stuhood stuhood commented May 16, 2020

Problem

The LMDB API allows for direct references into its storage, which will always be MMAP'ed. But because it has a blocking API, we interact with it on threads that have been spawned for that purpose.

To allow for interacting with the database in a way that avoids copying data into userspace, the load_bytes_with method takes a function that will be called on the blocking thread with a reference directly to that memory. This allows for minimizing copies of data, and avoiding holding full copies of database entries in memory.

But at some point a while back (before async/await made dealing with references easy again), load_bytes_with started passing a Bytes instance to its callback. And constructing a Bytes instance with anything other than a static memory reference (which this isn't) requires copying into the Bytes instance to give it ownership. This meant that we weren't actually taking advantage of the odd shape of load_bytes_with.

Solution

Switch the local load_bytes_with back to providing a reference into the database's owned memory, and document the reason for its shape. In separate commits, port the serverset crate and code that touches it to async/await. Finally, adjust the remote store_bytes to accept a reference to avoid potential copies there as well.

Result

Less memory usage, and less copying of data. In particular: cases that read from the local database in order to copy elsewhere (such as materialize_directory, ensure_remote_has_recursive, and BRFS read) will now copy data directly out of the database and into the destination.

ensure_remote_has_recursive should now hold onto only one chunk's worth of data at a time per file it is uploading, which might be sufficient to fix #9395 (although the other changes mentioned there will likely still be useful).

Stu Hood and others added 3 commits May 15, 2020 21:13
@stuhood stuhood requested review from gshuflin, illicitonion, jsirois and tdyas and removed request for gshuflin May 16, 2020 04:27
Copy link
Copy Markdown
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

Will post a few more thoughts on #9395, too :)

.to_boxed(),
let resource_name = resource_name.clone();
let store = store.clone();
let f = f.clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needed?

Copy link
Copy Markdown
Member Author

@stuhood stuhood May 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_byte_stream_client is executing a loop calling a closure that returns a future. The future that is created each time the closure is called takes ownership of the value, so we need the clone.

I think that I could fiddle with the lifetime parameters to declare that the closure returns a future with a lifetime tied to the inputs...? But this is green, so gonna let it slide...

@stuhood stuhood merged commit f48b32a into pantsbuild:master May 16, 2020
@stuhood stuhood deleted the stuhood/remove-copy-in-load-bytes-with branch May 16, 2020 16:28
Eric-Arellano added a commit that referenced this pull request Apr 21, 2021
Closes #11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. 

While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than #11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Apr 21, 2021
Closes pantsbuild#11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in pantsbuild#9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la pantsbuild#11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. 

While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than pantsbuild#11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Silent crash or failure when remoting.

2 participants