Skip to content

Make remote cache writes be async#11479

Merged
Eric-Arellano merged 6 commits into
pantsbuild:masterfrom
Eric-Arellano:cache-async-writes
Jan 22, 2021
Merged

Make remote cache writes be async#11479
Eric-Arellano merged 6 commits into
pantsbuild:masterfrom
Eric-Arellano:cache-async-writes

Conversation

@Eric-Arellano
Copy link
Copy Markdown
Contributor

@Eric-Arellano Eric-Arellano commented Jan 21, 2021

Closes #11434. This should avoid writing to the cache from noticeably slowing down Pants's performance.

This does add a new risk that some cache writes will not happen, particularly when Pantsd is not in use. To assess the problem, we add new metrics for when a cache write starts and finishes. If it becomes a serious problem, we can revisit this, including possibly requiring that Pantsd be used for remote cache writes.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Comment on lines -183 to +184
let (local_runner, local_runner_call_counter) = create_local_runner(1, 20);
let (local_runner, local_runner_call_counter) = create_local_runner(1, 100);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped these to reduce the risk of flaky tests. I'd rather have slightly slower tests than flaky tests.

Comment on lines +515 to +516
let command_runner = self.clone();
let result = result.clone();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stuhood for helping me to figure out why we needed to do this to avoid an issue with non-static lifetimes in the future!

Lmk if I should comment these two lines.

@Eric-Arellano
Copy link
Copy Markdown
Contributor Author

Hm, this causes remote_cache_integration_test.py to fail because the below log line no longer shows up. I haven't debugged yet why.

"Failed to write to remote cache:.*StubCAS is configured to always fail",

@tdyas
Copy link
Copy Markdown
Contributor

tdyas commented Jan 21, 2021

Hm, this causes remote_cache_integration_test.py to fail because the below log line no longer shows up. I haven't debugged yet why.

"Failed to write to remote cache:.*StubCAS is configured to always fail",

Probably because spawning the future via a direct call totokio::spawn no longer captures Rust logging from the future. See

fn future_with_correct_context<F: Future>(future: F) -> impl Future<Output = F::Output> {
let logging_destination = logging::get_destination();
let workunit_state = workunit_store::get_workunit_state();
// NB: It is important that the first portion of this method is synchronous (meaning that this
// method cannot be `async`), because that means that it will run on the thread that calls it.
// The second, async portion of the method will run in the spawned Task.
logging::scope_task_destination(logging_destination, async move {
workunit_store::scope_task_workunit_state(workunit_state, future).await
})
}
and its use in
pub fn spawn<O: Send + 'static, F: Future<Output = O> + Send + 'static>(
&self,
future: F,
) -> impl Future<Output = O> {
self
.handle
.spawn(Self::future_with_correct_context(future))
.map(|r| r.expect("Background task exited unsafely."))
}
to set the logging library's "destination".

Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated
@jsirois
Copy link
Copy Markdown
Contributor

jsirois commented Jan 21, 2021

I'm going to bow out of this one - I have no specific knowledge of these bits like Tom does.

@stuhood
Copy link
Copy Markdown
Member

stuhood commented Jan 21, 2021

Hm, this causes remote_cache_integration_test.py to fail because the below log line no longer shows up. I haven't debugged yet why.

"Failed to write to remote cache:.*StubCAS is configured to always fail",

Even then, I expect that that test will be flaky, because the session is (and should be) allowed to exit before those writes complete in a way that could be logged to the foreground.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano Eric-Arellano removed the request for review from jsirois January 21, 2021 19:55
@Eric-Arellano
Copy link
Copy Markdown
Contributor Author

So I tried to use task_executor::Executor::spawn, but it results in this test failure:

thread 'remote_cache_tests::cache_write_does_not_block' has overflowed its stack
fatal runtime error: stack overflow
error: test failed, to rerun pass '-p process_execution --lib'

Caused by:
  process didn't exit successfully: `/Users/eric/code/pants/src/rust/engine/target/debug/deps/process_execution-e8c510dc15d50329` (signal: 6, SIGABRT: process abort signal)

I'm trying to understand the error. The SIGABRT is saying that cargo test did not exit cleanly, right? The signal is not coming from Pants itself, but rather, the test having a stack overflow and thus sending SIGABRT to Cargo?

Eric-Arellano added a commit that referenced this pull request Jan 21, 2021
We will need Pantsd to be used for remote cache writes to work properly, post #11479.

Generally, pantsd has seen major improvements that this is a good change to make either way.

[ci skip-rust]
[ci skip-build-wheels]
Thanks Tom for the help!

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano
Copy link
Copy Markdown
Contributor Author

Eric-Arellano commented Jan 22, 2021

Even then, I expect that that test will be flaky, because the session is (and should be) allowed to exit before those writes complete in a way that could be logged to the foreground.

I ran the test 65 times on my M1 and it didn't flake. Note that pants_run() uses Pantsd, so maybe that helps? If this passes in CI, I'm thinking we can land but eagerly revert the moment we see a flake.

@Eric-Arellano Eric-Arellano requested a review from tdyas January 22, 2021 01:31
@stuhood
Copy link
Copy Markdown
Member

stuhood commented Jan 22, 2021

Even then, I expect that that test will be flaky, because the session is (and should be) allowed to exit before those writes complete in a way that could be logged to the foreground.

I ran the test 65 times on my M1 and it didn't flake. Note that pants_run() uses Pantsd, so maybe that helps? If this passes in CI, I'm thinking we can land but eagerly revert the moment we see a flake.

pantsd ensures that the write happens in the background, but the race is that the log message might happen "after" the client has gone away, and so there would not be a console left to log to. It's possible that the mock is fast enough that this isn't a problem in practice.

log::warn!("Failed to write to remote cache: {}", err);
context
.workunit_store
.increment_counter(Metric::RemoteCacheWriteErrors, 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not block this PR, but we should make sure cache errors still show up in stats when increments are "backgrounded" like this.

Self::new_with_delays(0, 0)
}

pub fn new_with_delays(read_delay_ms: u64, write_delay_ms: u64) -> Result<Self, String> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idiomatic Rust would generally just calls this with_delays.

@Eric-Arellano Eric-Arellano merged commit 3059645 into pantsbuild:master Jan 22, 2021
@Eric-Arellano Eric-Arellano deleted the cache-async-writes branch January 22, 2021 18:05
Eric-Arellano added a commit that referenced this pull request Apr 21, 2021
Closes #11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. 

While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than #11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Apr 21, 2021
Closes pantsbuild#11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in pantsbuild#9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else.

This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020.

Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la pantsbuild#11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. 

While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than pantsbuild#11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this pull request Jun 15, 2021
These metrics were added in #11479 to track what % of writes are actually completing due to the code block being async. But the metrics are useful beyond answering that particular question, and the current names are too specific.

This also changes the "finished" metric to only record successes, whereas before it included errors too. Previously, to get successes, you had to do `remote_cache_write_finished - remote_cache_write_errors`, but now we directly store this. To get back the number "finished", you will now do `remote_cache_write_successes + remote_cache_write_errors`.

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async (remote) cache writes

4 participants