Make remote cache writes be async by Eric-Arellano · Pull Request #11479 · pantsbuild/pants

Eric-Arellano · 2021-01-21T00:21:24Z

Closes #11434. This should avoid writing to the cache from noticeably slowing down Pants's performance.

This does add a new risk that some cache writes will not happen, particularly when Pantsd is not in use. To assess the problem, we add new metrics for when a cache write starts and finishes. If it becomes a serious problem, we can revisit this, including possibly requiring that Pantsd be used for remote cache writes.

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano · 2021-01-21T00:37:24Z

-  let (local_runner, local_runner_call_counter) = create_local_runner(1, 20);
+  let (local_runner, local_runner_call_counter) = create_local_runner(1, 100);


I bumped these to reduce the risk of flaky tests. I'd rather have slightly slower tests than flaky tests.

Eric-Arellano · 2021-01-21T00:38:14Z

+      let command_runner = self.clone();
+      let result = result.clone();


Thanks @stuhood for helping me to figure out why we needed to do this to avoid an issue with non-static lifetimes in the future!

Lmk if I should comment these two lines.

Eric-Arellano · 2021-01-21T01:25:21Z

Hm, this causes remote_cache_integration_test.py to fail because the below log line no longer shows up. I haven't debugged yet why.

"Failed to write to remote cache:.*StubCAS is configured to always fail",

tdyas · 2021-01-21T02:34:03Z

Hm, this causes remote_cache_integration_test.py to fail because the below log line no longer shows up. I haven't debugged yet why.

"Failed to write to remote cache:.*StubCAS is configured to always fail",

Probably because spawning the future via a direct call totokio::spawn no longer captures Rust logging from the future. See

pants/src/rust/engine/task_executor/src/lib.rs

Lines 163 to 174 in 9277842

    
             fn future_with_correct_context<F: Future>(future: F) -> impl Future<Output = F::Output> { 
        
               let logging_destination = logging::get_destination(); 
        
               let workunit_state = workunit_store::get_workunit_state(); 
        
               // NB: It is important that the first portion of this method is synchronous (meaning that this 
        
               // method cannot be `async`), because that means that it will run on the thread that calls it. 
        
               // The second, async portion of the method will run in the spawned Task. 
        
               logging::scope_task_destination(logging_destination, async move { 
        
                 workunit_store::scope_task_workunit_state(workunit_state, future).await 
        
               }) 
        
             }

and its use in

pants/src/rust/engine/task_executor/src/lib.rs

Lines 99 to 107 in 9277842

    
             pub fn spawn<O: Send + 'static, F: Future<Output = O> + Send + 'static>( 
        
               &self, 
        
               future: F, 
        
             ) -> impl Future<Output = O> { 
        
               self 
        
                 .handle 
        
                 .spawn(Self::future_with_correct_context(future)) 
        
                 .map(|r| r.expect("Background task exited unsafely.")) 
        
             }

to set the logging library's "destination".

jsirois · 2021-01-21T06:37:30Z

I'm going to bow out of this one - I have no specific knowledge of these bits like Tom does.

stuhood · 2021-01-21T18:24:04Z

Hm, this causes remote_cache_integration_test.py to fail because the below log line no longer shows up. I haven't debugged yet why.

"Failed to write to remote cache:.*StubCAS is configured to always fail",

Even then, I expect that that test will be flaky, because the session is (and should be) allowed to exit before those writes complete in a way that could be logged to the foreground.

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano · 2021-01-21T19:58:09Z

So I tried to use task_executor::Executor::spawn, but it results in this test failure:

thread 'remote_cache_tests::cache_write_does_not_block' has overflowed its stack
fatal runtime error: stack overflow
error: test failed, to rerun pass '-p process_execution --lib'

Caused by:
  process didn't exit successfully: `/Users/eric/code/pants/src/rust/engine/target/debug/deps/process_execution-e8c510dc15d50329` (signal: 6, SIGABRT: process abort signal)

I'm trying to understand the error. The SIGABRT is saying that cargo test did not exit cleanly, right? The signal is not coming from Pants itself, but rather, the test having a stack overflow and thus sending SIGABRT to Cargo?

We will need Pantsd to be used for remote cache writes to work properly, post #11479. Generally, pantsd has seen major improvements that this is a good change to make either way. [ci skip-rust] [ci skip-build-wheels]

Thanks Tom for the help! # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

…-writes

Eric-Arellano · 2021-01-22T01:30:26Z

Even then, I expect that that test will be flaky, because the session is (and should be) allowed to exit before those writes complete in a way that could be logged to the foreground.

I ran the test 65 times on my M1 and it didn't flake. Note that pants_run() uses Pantsd, so maybe that helps? If this passes in CI, I'm thinking we can land but eagerly revert the moment we see a flake.

stuhood · 2021-01-22T17:27:41Z

Even then, I expect that that test will be flaky, because the session is (and should be) allowed to exit before those writes complete in a way that could be logged to the foreground.

I ran the test 65 times on my M1 and it didn't flake. Note that pants_run() uses Pantsd, so maybe that helps? If this passes in CI, I'm thinking we can land but eagerly revert the moment we see a flake.

pantsd ensures that the write happens in the background, but the race is that the log message might happen "after" the client has gone away, and so there would not be a console left to log to. It's possible that the mock is fast enough that this isn't a problem in practice.

tdyas · 2021-01-22T18:02:04Z

+            log::warn!("Failed to write to remote cache: {}", err);
+            context
+              .workunit_store
+              .increment_counter(Metric::RemoteCacheWriteErrors, 1);


Should not block this PR, but we should make sure cache errors still show up in stats when increments are "backgrounded" like this.

tdyas · 2021-01-22T18:03:29Z

+    Self::new_with_delays(0, 0)
+  }
+
+  pub fn new_with_delays(read_delay_ms: u64, write_delay_ms: u64) -> Result<Self, String> {


Idiomatic Rust would generally just calls this with_delays.

Closes #11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else. This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020. Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than #11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.

Closes pantsbuild#11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in pantsbuild#9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else. This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020. Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la pantsbuild#11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than pantsbuild#11908, i.e. that slowness in remote cache writes can slow down and even hang Pants. # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

These metrics were added in #11479 to track what % of writes are actually completing due to the code block being async. But the metrics are useful beyond answering that particular question, and the current names are too specific. This also changes the "finished" metric to only record successes, whereas before it included errors too. Previously, to get successes, you had to do `remote_cache_write_finished - remote_cache_write_errors`, but now we directly store this. To get back the number "finished", you will now do `remote_cache_write_successes + remote_cache_write_errors`. [ci skip-build-wheels]

Eric-Arellano added 2 commits January 20, 2021 16:28

Make remote cache writes be async

c29c61a

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Add a test + fix the other cache_write tests

c77ef59

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano requested review from jsirois, stuhood and tdyas January 21, 2021 00:21

Eric-Arellano commented Jan 21, 2021

View reviewed changes

tdyas reviewed Jan 21, 2021

View reviewed changes

Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated

Add counters for cache write start and finish

0c99342

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano mentioned this pull request Jan 21, 2021

Internal change: enable pantsd in CI #11487

Merged

Use task_executor::Executor::spawn instead of tokio::spawn

1f91152

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano removed the request for review from jsirois January 21, 2021 19:55

Eric-Arellano added 2 commits January 21, 2021 18:18

Fix test failing due to stack overflow

2d0a140

Thanks Tom for the help! # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Merge branch 'master' of github.com:pantsbuild/pants into cache-async…

589f535

…-writes

Eric-Arellano requested a review from tdyas January 22, 2021 01:31

stuhood approved these changes Jan 22, 2021

View reviewed changes

tdyas approved these changes Jan 22, 2021

View reviewed changes

Eric-Arellano merged commit 3059645 into pantsbuild:master Jan 22, 2021

Eric-Arellano deleted the cache-async-writes branch January 22, 2021 18:05

Eric-Arellano mentioned this pull request Feb 10, 2021

Counters can have dangling pointers to workunits #11548

Closed

Eric-Arellano mentioned this pull request Apr 20, 2021

Fix remote cache writes to not block the Pants run #11947

Merged

Eric-Arellano mentioned this pull request Jun 14, 2021

Rename remote cache write start and finished metrics #12202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make remote cache writes be async#11479

Make remote cache writes be async#11479
Eric-Arellano merged 6 commits into
pantsbuild:masterfrom
Eric-Arellano:cache-async-writes

Eric-Arellano commented Jan 21, 2021 •

edited

Loading

Uh oh!

Eric-Arellano Jan 21, 2021

Uh oh!

Eric-Arellano Jan 21, 2021

Uh oh!

Eric-Arellano commented Jan 21, 2021

Uh oh!

tdyas commented Jan 21, 2021

Uh oh!

Uh oh!

jsirois commented Jan 21, 2021

Uh oh!

stuhood commented Jan 21, 2021 •

edited

Loading

Uh oh!

Eric-Arellano commented Jan 21, 2021

Uh oh!

Eric-Arellano commented Jan 22, 2021 •

edited

Loading

Uh oh!

stuhood commented Jan 22, 2021 •

edited

Loading

Uh oh!

tdyas Jan 22, 2021

Uh oh!

tdyas Jan 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		let (local_runner, local_runner_call_counter) = create_local_runner(1, 20);
		let (local_runner, local_runner_call_counter) = create_local_runner(1, 100);

		let command_runner = self.clone();
		let result = result.clone();

Uh oh!

Conversation

Eric-Arellano commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eric-Arellano Jan 21, 2021

Choose a reason for hiding this comment

Uh oh!

Eric-Arellano Jan 21, 2021

Choose a reason for hiding this comment

Uh oh!

Eric-Arellano commented Jan 21, 2021

Uh oh!

tdyas commented Jan 21, 2021

Uh oh!

Uh oh!

jsirois commented Jan 21, 2021

Uh oh!

stuhood commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eric-Arellano commented Jan 21, 2021

Uh oh!

Eric-Arellano commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuhood commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdyas Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

tdyas Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Eric-Arellano commented Jan 21, 2021 •

edited

Loading

stuhood commented Jan 21, 2021 •

edited

Loading

Eric-Arellano commented Jan 22, 2021 •

edited

Loading

stuhood commented Jan 22, 2021 •

edited

Loading