Skip to content

Fix remote cache writes to not block the Pants run#11947

Merged
Eric-Arellano merged 2 commits into
pantsbuild:mainfrom
Eric-Arellano:fix-blocking
Apr 21, 2021
Merged

Fix remote cache writes to not block the Pants run#11947
Eric-Arellano merged 2 commits into
pantsbuild:mainfrom
Eric-Arellano:fix-blocking

Conversation

@Eric-Arellano
Copy link
Copy Markdown
Contributor

@Eric-Arellano Eric-Arellano commented Apr 20, 2021

Closes #11908. As described there, ensure_remote_has_recursive() was blocking due to its call to executor.block_on(). This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using Executor.spawn_blocking(). We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it.

While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than #11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.

@Eric-Arellano
Copy link
Copy Markdown
Contributor Author

Eric-Arellano commented Apr 20, 2021

There's a rough edge still: if the server is being slow to respond, the dynamic UI will show "(Waiting) Find Python binary", for example. It is true that it's waiting, but really all that is waiting is the cache write; the process has already finished and the rest of Pants has proceeded. This is confusing UX. I don't know how we could fix this, though..lmk if I should open an issue.

Working on this PR also discovered a new bug: the remote cache write metrics aren't reported if the write did not finish. I'll open an issue for that.

Copy link
Copy Markdown
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the copy makes sense (sad though it may be) - it's the only way to make sure the bytes are still alive across async boundaries in the callback, and I guess the block_on was failing if we hit the thread limit of blocking threads at the wrong point, which I can see causing a deadlock...

/// blocking, this accepts a function that views a slice rather than returning a clone of the
/// data. The upshot is that the database is able to provide slices directly into shared memory.
///
/// The provided function is guaranteed to be called in a context where it is safe to block.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you audited all of the other transitive callers to make sure they don't block? IIRC a bunch of things like materialize_file end up calling through here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we do use this for a couple places like materialize_file.

Maybe I was wrong to remove this comment? It is safe to block, only it is not performant. But I spent a few minutes checking all the transitive call sites and I do think that we want to block on all those things because they are synchronous operations like materializing a digest.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd restore the comment if it's still valid :) Thanks for checking!

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@stuhood
Copy link
Copy Markdown
Member

stuhood commented Apr 20, 2021

Closes #11908. As described there, ensure_remote_has_recursive() was blocking due to its call to executor.block_on(). This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

Blocking on a spawned thread is ok (tokio will let you: in other contexts it will panic instead), so I believe that that reasoning still stands.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using Executor.spawn_blocking(). We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it.

This doesn't follow I don't think... it is completely compatible to satisfy a future/async task from a blocking task: that's ok, and how threadpools are implemented in general. And references can be used whenever data stays on the stack for the entire length of the async operation: in this case, the referenced data stays on the stack of the blocked thread.


It is safe to block, only it is not performant.

This could be because we're setting https://docs.rs/tokio/1.5.0/tokio/runtime/struct.Builder.html#method.max_blocking_threads inappropriately here:

let runtime = Builder::new_multi_thread()
.worker_threads(num_worker_threads)
.max_blocking_threads(max_threads - num_worker_threads)
.enable_all()
.build()
.map_err(|e| format!("Failed to start the runtime: {}", e))?;
... I would not expect that to cause a deadlock: just be slow. You could see whether adjusting that setting affects your repro.

I've also opened tokio-rs/tokio#3717 to try and get some clarity there.

Copy link
Copy Markdown
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the conclusion (see above), but I'll shipit this as a temporary measure. Thanks for investigating.

@Eric-Arellano
Copy link
Copy Markdown
Contributor Author

Going to merge and cherry-pick into 2.4.x to stabilize remote caching. If we hear back from Tokio that it's possible to get the original semantics, we can forward fix this PR in 2.5.x or later releases.

Thanks for the reviews!

@Eric-Arellano Eric-Arellano merged commit 90d46bf into pantsbuild:main Apr 21, 2021
@Eric-Arellano Eric-Arellano deleted the fix-blocking branch April 21, 2021 00:16
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Apr 21, 2021
Closes pantsbuild#11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in pantsbuild#9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la pantsbuild#11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. 

While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than pantsbuild#11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
let executor = local.executor().clone();
let maybe_upload = local
// We need to copy the bytes into memory so that they may be used safely in an async
// future. While this unfortunately increases memory consumption, we prioritize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but that's also just being literal with existing APIs / trying to make a minimal change IIUC. The remote store could also have a local store (or get handed one) and be given an entry_type and digest and told to store bytes. It could then handle the shuffle instead of buffer all in its own task.

@stuhood
Copy link
Copy Markdown
Member

stuhood commented Jun 7, 2021

So, finally got around to reading the response to tokio-rs/tokio#3717 ... what I expect happened is that we were obeying this principle:

One rule you can follow which guarantees that you have no deadlocks due to the spawn_blocking pool capacity is that any task running on it (using block_on or not) should eventually exit on its own even if all other threads were paused for the full duration of the spawn_blocking call.

...until we switched to using tower for gRPC. Because before that, the native gRPC implementation we were using was spawning its own event threads in order to interact with the C/C++ gRPC code (AFAIK), and so it was not limited by our tokio configuration:

let runtime = Builder::new_multi_thread()
.worker_threads(num_worker_threads)
.max_blocking_threads(max_threads - num_worker_threads)
.enable_all()
.build()
.map_err(|e| format!("Failed to start the runtime: {}", e))?;

So, as mentioned on the discussion: one approach would be to use a dedicated unbounded pool for interacting with LMDB. Another might be to separate the start of an LMDB read transaction (which would involve syscalls) from the portion of the transaction that actually interacts with the reference to the data (which would only trigger page faults to load data from MMAP). To avoid async-closure lifetime issues, we'd probably need to do that by removing the closure to load_bytes_with, and give it a more natural shape where it returned a struct with a lifetime tied to the already open read transaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI sometimes hangs when writing to the remote cache

4 participants