Fix remote cache writes to not block the Pants run#11947
Conversation
|
There's a rough edge still: if the server is being slow to respond, the dynamic UI will show "(Waiting) Find Python binary", for example. It is true that it's waiting, but really all that is waiting is the cache write; the process has already finished and the rest of Pants has proceeded. This is confusing UX. I don't know how we could fix this, though..lmk if I should open an issue. Working on this PR also discovered a new bug: the remote cache write metrics aren't reported if the write did not finish. I'll open an issue for that. |
illicitonion
left a comment
There was a problem hiding this comment.
I think the copy makes sense (sad though it may be) - it's the only way to make sure the bytes are still alive across async boundaries in the callback, and I guess the block_on was failing if we hit the thread limit of blocking threads at the wrong point, which I can see causing a deadlock...
| /// blocking, this accepts a function that views a slice rather than returning a clone of the | ||
| /// data. The upshot is that the database is able to provide slices directly into shared memory. | ||
| /// | ||
| /// The provided function is guaranteed to be called in a context where it is safe to block. |
There was a problem hiding this comment.
Have you audited all of the other transitive callers to make sure they don't block? IIRC a bunch of things like materialize_file end up calling through here too?
There was a problem hiding this comment.
Yep, we do use this for a couple places like materialize_file.
Maybe I was wrong to remove this comment? It is safe to block, only it is not performant. But I spent a few minutes checking all the transitive call sites and I do think that we want to block on all those things because they are synchronous operations like materializing a digest.
There was a problem hiding this comment.
Yeah, I'd restore the comment if it's still valid :) Thanks for checking!
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
Blocking on a spawned thread is ok (tokio will let you: in other contexts it will panic instead), so I believe that that reasoning still stands.
This doesn't follow I don't think... it is completely compatible to satisfy a future/async task from a blocking task: that's ok, and how threadpools are implemented in general. And references can be used whenever data stays on the stack for the entire length of the async operation: in this case, the referenced data stays on the stack of the blocked thread.
This could be because we're setting https://docs.rs/tokio/1.5.0/tokio/runtime/struct.Builder.html#method.max_blocking_threads inappropriately here: pants/src/rust/engine/task_executor/src/lib.rs Lines 83 to 88 in af5d5a2 I've also opened tokio-rs/tokio#3717 to try and get some clarity there. |
stuhood
left a comment
There was a problem hiding this comment.
I'm not sure about the conclusion (see above), but I'll shipit this as a temporary measure. Thanks for investigating.
|
Going to merge and cherry-pick into 2.4.x to stabilize remote caching. If we hear back from Tokio that it's possible to get the original semantics, we can forward fix this PR in 2.5.x or later releases. Thanks for the reviews! |
Closes pantsbuild#11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in pantsbuild#9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else. This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020. Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la pantsbuild#11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than pantsbuild#11908, i.e. that slowness in remote cache writes can slow down and even hang Pants. # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
| let executor = local.executor().clone(); | ||
| let maybe_upload = local | ||
| // We need to copy the bytes into memory so that they may be used safely in an async | ||
| // future. While this unfortunately increases memory consumption, we prioritize |
There was a problem hiding this comment.
... but that's also just being literal with existing APIs / trying to make a minimal change IIUC. The remote store could also have a local store (or get handed one) and be given an entry_type and digest and told to store bytes. It could then handle the shuffle instead of buffer all in its own task.
|
So, finally got around to reading the response to tokio-rs/tokio#3717 ... what I expect happened is that we were obeying this principle:
...until we switched to using pants/src/rust/engine/task_executor/src/lib.rs Lines 83 to 88 in af5d5a2 So, as mentioned on the discussion: one approach would be to use a dedicated unbounded pool for interacting with LMDB. Another might be to separate the start of an LMDB read transaction (which would involve syscalls) from the portion of the transaction that actually interacts with the reference to the data (which would only trigger page faults to load data from MMAP). To avoid async-closure lifetime issues, we'd probably need to do that by removing the closure to |
Closes #11908. As described there,
ensure_remote_has_recursive()was blocking due to its call toexecutor.block_on(). This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.
Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using
Executor.spawn_blocking(). We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it.While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than #11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.