Fix remote cache writes to not block the Pants run by Eric-Arellano · Pull Request #11947 · pantsbuild/pants

Eric-Arellano · 2021-04-20T21:06:03Z

Closes #11908. As described there, ensure_remote_has_recursive() was blocking due to its call to executor.block_on(). This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using Executor.spawn_blocking(). We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it.

While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than #11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.

Eric-Arellano · 2021-04-20T21:08:47Z

There's a rough edge still: if the server is being slow to respond, the dynamic UI will show "(Waiting) Find Python binary", for example. It is true that it's waiting, but really all that is waiting is the cache write; the process has already finished and the rest of Pants has proceeded. This is confusing UX. I don't know how we could fix this, though..lmk if I should open an issue.

Working on this PR also discovered a new bug: the remote cache write metrics aren't reported if the write did not finish. I'll open an issue for that.

illicitonion

I think the copy makes sense (sad though it may be) - it's the only way to make sure the bytes are still alive across async boundaries in the callback, and I guess the block_on was failing if we hit the thread limit of blocking threads at the wrong point, which I can see causing a deadlock...

illicitonion · 2021-04-20T21:53:36Z

  /// blocking, this accepts a function that views a slice rather than returning a clone of the
  /// data. The upshot is that the database is able to provide slices directly into shared memory.
  ///
-  /// The provided function is guaranteed to be called in a context where it is safe to block.


Have you audited all of the other transitive callers to make sure they don't block? IIRC a bunch of things like materialize_file end up calling through here too?

Yep, we do use this for a couple places like materialize_file.

Maybe I was wrong to remove this comment? It is safe to block, only it is not performant. But I spent a few minutes checking all the transitive call sites and I do think that we want to block on all those things because they are synchronous operations like materializing a digest.

Yeah, I'd restore the comment if it's still valid :) Thanks for checking!

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

stuhood · 2021-04-20T23:54:30Z

Closes #11908. As described there, ensure_remote_has_recursive() was blocking due to its call to executor.block_on(). This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

Blocking on a spawned thread is ok (tokio will let you: in other contexts it will panic instead), so I believe that that reasoning still stands.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using Executor.spawn_blocking(). We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it.

This doesn't follow I don't think... it is completely compatible to satisfy a future/async task from a blocking task: that's ok, and how threadpools are implemented in general. And references can be used whenever data stays on the stack for the entire length of the async operation: in this case, the referenced data stays on the stack of the blocked thread.

It is safe to block, only it is not performant.

This could be because we're setting https://docs.rs/tokio/1.5.0/tokio/runtime/struct.Builder.html#method.max_blocking_threads inappropriately here:

pants/src/rust/engine/task_executor/src/lib.rs

Lines 83 to 88 in af5d5a2

    
           let runtime = Builder::new_multi_thread() 
        
             .worker_threads(num_worker_threads) 
        
             .max_blocking_threads(max_threads - num_worker_threads) 
        
             .enable_all() 
        
             .build() 
        
             .map_err(|e| format!("Failed to start the runtime: {}", e))?;

... I would not expect that to cause a deadlock: just be slow. You could see whether adjusting that setting affects your repro.

I've also opened tokio-rs/tokio#3717 to try and get some clarity there.

stuhood

I'm not sure about the conclusion (see above), but I'll shipit this as a temporary measure. Thanks for investigating.

Eric-Arellano · 2021-04-21T00:16:24Z

Going to merge and cherry-pick into 2.4.x to stabilize remote caching. If we hear back from Tokio that it's possible to get the original semantics, we can forward fix this PR in 2.5.x or later releases.

Thanks for the reviews!

Closes pantsbuild#11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in pantsbuild#9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else. This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020. Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la pantsbuild#11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than pantsbuild#11908, i.e. that slowness in remote cache writes can slow down and even hang Pants. # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

…1947) (#11950)

jsirois · 2021-05-13T00:15:34Z

-              let executor = local.executor().clone();
-              let maybe_upload = local
+              // We need to copy the bytes into memory so that they may be used safely in an async
+              // future. While this unfortunately increases memory consumption, we prioritize


... but that's also just being literal with existing APIs / trying to make a minimal change IIUC. The remote store could also have a local store (or get handed one) and be given an entry_type and digest and told to store bytes. It could then handle the shuffle instead of buffer all in its own task.

stuhood · 2021-06-07T22:00:28Z

So, finally got around to reading the response to tokio-rs/tokio#3717 ... what I expect happened is that we were obeying this principle:

One rule you can follow which guarantees that you have no deadlocks due to the spawn_blocking pool capacity is that any task running on it (using block_on or not) should eventually exit on its own even if all other threads were paused for the full duration of the spawn_blocking call.

...until we switched to using tower for gRPC. Because before that, the native gRPC implementation we were using was spawning its own event threads in order to interact with the C/C++ gRPC code (AFAIK), and so it was not limited by our tokio configuration:

pants/src/rust/engine/task_executor/src/lib.rs

Lines 83 to 88 in af5d5a2

    
           let runtime = Builder::new_multi_thread() 
        
             .worker_threads(num_worker_threads) 
        
             .max_blocking_threads(max_threads - num_worker_threads) 
        
             .enable_all() 
        
             .build() 
        
             .map_err(|e| format!("Failed to start the runtime: {}", e))?;

So, as mentioned on the discussion: one approach would be to use a dedicated unbounded pool for interacting with LMDB. Another might be to separate the start of an LMDB read transaction (which would involve syscalls) from the portion of the transaction that actually interacts with the reference to the data (which would only trigger page faults to load data from MMAP). To avoid async-closure lifetime issues, we'd probably need to do that by removing the closure to load_bytes_with, and give it a more natural shape where it returned a struct with a lifetime tied to the already open read transaction.

Fix remote cache writes to not block the Pants run

6348208

Eric-Arellano requested review from illicitonion, stuhood and tdyas April 20, 2021 21:06

illicitonion reviewed Apr 20, 2021

View reviewed changes

Fix test and better format cache.rs

658ba32

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

stuhood approved these changes Apr 20, 2021

View reviewed changes

Eric-Arellano merged commit 90d46bf into pantsbuild:main Apr 21, 2021

Eric-Arellano deleted the fix-blocking branch April 21, 2021 00:16

Eric-Arellano added a commit that referenced this pull request Apr 21, 2021

Fix remote cache writes to not block the Pants run (Cherry-pick of #1…

4c81520

…1947) (#11950)

jsirois reviewed May 13, 2021

View reviewed changes

stuhood mentioned this pull request Jun 7, 2021

Fix remote cache writes memory exhaustion. #12087

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix remote cache writes to not block the Pants run#11947

Fix remote cache writes to not block the Pants run#11947
Eric-Arellano merged 2 commits into
pantsbuild:mainfrom
Eric-Arellano:fix-blocking

Eric-Arellano commented Apr 20, 2021 •

edited

Loading

Uh oh!

Eric-Arellano commented Apr 20, 2021 •

edited

Loading

Uh oh!

illicitonion left a comment

Uh oh!

illicitonion Apr 20, 2021

Uh oh!

Eric-Arellano Apr 20, 2021

Uh oh!

illicitonion Apr 21, 2021

Uh oh!

stuhood commented Apr 20, 2021 •

edited

Loading

Uh oh!

stuhood left a comment

Uh oh!

Eric-Arellano commented Apr 21, 2021

Uh oh!

jsirois May 13, 2021

Uh oh!

stuhood commented Jun 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Eric-Arellano commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eric-Arellano commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

illicitonion left a comment

Choose a reason for hiding this comment

Uh oh!

illicitonion Apr 20, 2021

Choose a reason for hiding this comment

Uh oh!

Eric-Arellano Apr 20, 2021

Choose a reason for hiding this comment

Uh oh!

illicitonion Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

stuhood commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuhood left a comment

Choose a reason for hiding this comment

Uh oh!

Eric-Arellano commented Apr 21, 2021

Uh oh!

jsirois May 13, 2021

Choose a reason for hiding this comment

Uh oh!

stuhood commented Jun 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Eric-Arellano commented Apr 20, 2021 •

edited

Loading

Eric-Arellano commented Apr 20, 2021 •

edited

Loading

stuhood commented Apr 20, 2021 •

edited

Loading