Make remote cache writes be async#11479
Conversation
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
| let (local_runner, local_runner_call_counter) = create_local_runner(1, 20); | ||
| let (local_runner, local_runner_call_counter) = create_local_runner(1, 100); |
There was a problem hiding this comment.
I bumped these to reduce the risk of flaky tests. I'd rather have slightly slower tests than flaky tests.
| let command_runner = self.clone(); | ||
| let result = result.clone(); |
There was a problem hiding this comment.
Thanks @stuhood for helping me to figure out why we needed to do this to avoid an issue with non-static lifetimes in the future!
Lmk if I should comment these two lines.
|
Hm, this causes
|
Probably because spawning the future via a direct call to pants/src/rust/engine/task_executor/src/lib.rs Lines 163 to 174 in 9277842 pants/src/rust/engine/task_executor/src/lib.rs Lines 99 to 107 in 9277842 |
|
I'm going to bow out of this one - I have no specific knowledge of these bits like Tom does. |
Even then, I expect that that test will be flaky, because the session is (and should be) allowed to exit before those writes complete in a way that could be logged to the foreground. |
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
|
So I tried to use I'm trying to understand the error. The SIGABRT is saying that |
We will need Pantsd to be used for remote cache writes to work properly, post #11479. Generally, pantsd has seen major improvements that this is a good change to make either way. [ci skip-rust] [ci skip-build-wheels]
Thanks Tom for the help! # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
I ran the test 65 times on my M1 and it didn't flake. Note that |
|
| log::warn!("Failed to write to remote cache: {}", err); | ||
| context | ||
| .workunit_store | ||
| .increment_counter(Metric::RemoteCacheWriteErrors, 1); |
There was a problem hiding this comment.
Should not block this PR, but we should make sure cache errors still show up in stats when increments are "backgrounded" like this.
| Self::new_with_delays(0, 0) | ||
| } | ||
|
|
||
| pub fn new_with_delays(read_delay_ms: u64, write_delay_ms: u64) -> Result<Self, String> { |
There was a problem hiding this comment.
Idiomatic Rust would generally just calls this with_delays.
Closes #11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in #9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else. This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020. Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la #11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than #11908, i.e. that slowness in remote cache writes can slow down and even hang Pants.
Closes pantsbuild#11908. As described there, `ensure_remote_has_recursive()` was blocking due to its call to `executor.block_on()`. This was introduced in pantsbuild#9793, which reasoned that only the spawned thread would get blocked, which would be safe - but it turns out that this blocking stops Pants from doing anything else. This was not introduced due to the upgrade from Tokio 0.2 to 1.x, and it's plausible this never worked as intended with remote caching, given that remote caching was not available in May 2020. Fundamentally, the issue makes sense. Reading from LMDB Store must be synchronous to be safe, which we correctly expressed, e.g. via using `Executor.spawn_blocking()`. We tried to minimize memory consumption by allowing for a callback to access a reference/slice of the bytes, rather than cloning it. However, we desire for the remote code to be async, e.g. so that it can finish in the background a la pantsbuild#11479. If it's async, it fundamentally would not be able to use a reference because that reference may no longer be valid - we need to clone the data to fully own it. While cloning the bytes will result in more memory consumption, it is imperative that remote caching fails gracefully. The increase in memory consumption is less offensive than pantsbuild#11908, i.e. that slowness in remote cache writes can slow down and even hang Pants. # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
These metrics were added in #11479 to track what % of writes are actually completing due to the code block being async. But the metrics are useful beyond answering that particular question, and the current names are too specific. This also changes the "finished" metric to only record successes, whereas before it included errors too. Previously, to get successes, you had to do `remote_cache_write_finished - remote_cache_write_errors`, but now we directly store this. To get back the number "finished", you will now do `remote_cache_write_successes + remote_cache_write_errors`. [ci skip-build-wheels]
Closes #11434. This should avoid writing to the cache from noticeably slowing down Pants's performance.
This does add a new risk that some cache writes will not happen, particularly when Pantsd is not in use. To assess the problem, we add new metrics for when a cache write starts and finishes. If it becomes a serious problem, we can revisit this, including possibly requiring that Pantsd be used for remote cache writes.