Skip to content

refactor all gRPC usages to use Tonic instead of grpcio#11307

Merged
tdyas merged 34 commits into
pantsbuild:masterfrom
tdyas:tonic_grpc_refactor
Dec 18, 2020
Merged

refactor all gRPC usages to use Tonic instead of grpcio#11307
tdyas merged 34 commits into
pantsbuild:masterfrom
tdyas:tonic_grpc_refactor

Conversation

@tdyas
Copy link
Copy Markdown
Contributor

@tdyas tdyas commented Dec 14, 2020

Problem

In testing the remote cache code against Toolchain's remote cache cluster (with both an AWS ALB load balancer and nginx ingress), the Pants client was erroring when receiving HTTP/2 GOAWAY frames which are allowed by the HTTP/2 standard in ordinary course of shutting down a connection. The GOAWAY frame should not cause an error when there is no error attached since it is just the server closing the connection, which it is allowed to do. The client should just reconnect.

Pants uses a very old version of the grpcio library (v0.5.x). Thus, this issue could potentially be a bug in grpcio but could have been fixed in a more recent 1.x series release. Long term, we want to switch to Tonic so instead of upgrading grpcio and hoping the issue is fixed, the better use of time is to just port Pants to Tonic.

Solution

Port all gRPC usage in Pants to the Tonic library and the Prost protobuf code generator. This provides idiomatic Rust bindings for gRPC and protobuf and is directly integrated into tokio and hyper.

Result

Existing tests continue to pass.

@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 14, 2020

Caveat:

  • The size of some of the futures is too large and not all tests can run together. Will figure out where to introduce Boxed futures. In the meantime, I set RUST_MIN_STACK in CI to side step the issue.

Comment thread src/rust/engine/fs/store/src/remote.rs
Comment thread src/rust/engine/fs/store/src/remote.rs Outdated
Comment thread src/rust/engine/fs/store/src/remote.rs Outdated
Comment thread src/rust/engine/fs/store/src/remote.rs Outdated
Comment thread src/rust/engine/fs/store/src/remote.rs Outdated
}

#[tokio::test]
// TODO(tonic): Ignored this test because, while `ByteStore` is configured with both endpoints,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tonic did not seem to round robin all the time. This test technically relies on an implementation detail of grpcio.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine - honestly I'd consider (as a follow-up) just taking a single String rather than Vec<String> for a CAS endpoint. Again, the only reason we supported multiple was because Scoot was trying to avoid server-side load balancing.

Digest(fp, size_bytes)
}

// TODO(tonic): Replace use of this method with `.into` or equivalent.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also quite a few places in the code where I open-coded what this method does. We should choose one way or the other. My concern is that digest == None may not be equivalent to digest == Some(EMPTY_DIGEST) in some places.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, digest == None is always an error indicating a non-complaint server :(

Comment thread src/rust/engine/process_execution/src/remote.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated
Copy link
Copy Markdown
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Thanks!

Comment thread src/rust/engine/fs/fs_util/src/main.rs Outdated
.digest
.as_ref()
.map(|d| d.try_into())
.unwrap_or(Ok(EMPTY_DIGEST));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case actually points at an invalid proto, and we should probably return an error, rather than defaulting to the empty digest? These fields are required, even if that's not enforceable in protobuf schema.

Maybe a helper in the bazel_protos crate:

fn require_digest(digest: Option<bazel_protos::Digest>) -> Result<hashing::Digest, String> {

or an

impl TryFrom<Option<bazel_protos::Digest>> for hashing::Digest

so you could just:

hashing::Digest::try_from(file.digest)?

(And throughout this PR)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try impl TryFrom<Option<bazel_protos::Digest>> for hashing::Digest but it fails due to limitations in std and the current Rust stable versions:

error[E0119]: conflicting implementations of trait `std::convert::TryFrom<std::option::Option<gen::build::bazel::remote::execution::v2::Digest>>` for type `hashing::Digest`:
  --> process_execution/bazel_protos/src/conversions.rs:46:1
   |
46 | impl TryFrom<Option<crate::gen::build::bazel::remote::execution::v2::Digest>> for hashing::Digest {
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: conflicting implementation in crate `core`:
           - impl<T, U> TryFrom<U> for T
             where U: Into<T>;
   = note: upstream crates may add a new impl of trait `std::convert::From<std::option::Option<gen::build::bazel::remote::execution::v2::Digest>>` for type `hashing::Digest` in future versions

error[E0117]: only traits defined in the current crate can be implemented for arbitrary types
  --> process_execution/bazel_protos/src/conversions.rs:46:1
   |
46 | impl TryFrom<Option<crate::gen::build::bazel::remote::execution::v2::Digest>> for hashing::Digest {
   | ^^^^^------------------------------------------------------------------------^^^^^---------------
   | |    |                                                                            |
   | |    |                                                                            `hashing::Digest` is not defined in the current crate
   | |    `std::option::Option` is not defined in the current crate
   | impl doesn't use only types from inside the current crate
   |
   = note: define and implement a trait or new type instead

I could move all of the conversions to hashing to try and avoid the cycle between the crates. But there is a generic implementation for TryFrom that conflicts with any other implementations, and we would apparently need specialization in stable and not nightly to be able to do the Option bit to avoid the catch-all generic impl. rust-lang/rust#50133

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, a method is just as usable :) Thanks for trying!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will go with a method, I actually had already introduced one called to_pants_digest_opt as part of these "half one way, half the other way" solutions.

Comment thread src/rust/engine/fs/fs_util/src/main.rs Outdated
.digest
.as_ref()
.map(|d| d.try_into())
.unwrap_or(Ok(EMPTY_DIGEST));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

let bytes = directory
.write_to_bytes()
.map_err(|e| format!("Error serializing directory proto {:?}: {:?}", directory, e))?;
let mut buf = BytesMut::with_capacity(directory.encoded_len());
Copy link
Copy Markdown
Contributor

@illicitonion illicitonion Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like a helper function along the lines of:

fn to_bytes<Message: prost::Message>(m: &Message) -> Bytes {

could be handy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably better done as a separate PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/rust/engine/fs/store/src/lib.rs Outdated
Comment thread src/rust/engine/fs/store/src/lib.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote_tests.rs Outdated
Comment thread src/rust/engine/process_execution/bazel_protos/src/verification_tests.rs Outdated
Comment thread src/rust/engine/process_execution/bazel_protos/src/verification_tests.rs Outdated
@stuhood
Copy link
Copy Markdown
Member

stuhood commented Dec 15, 2020

@illicitonion : Thanks a lot for the detailed review! I'll wait a round before reviewing.

@Eric-Arellano
Copy link
Copy Markdown
Contributor

and a nice to have fix would be to remove the dead bootstrap of tools used to compile the native gRPC code:

I can tackle this if you'd prefer. I'm still trying to land #11256, which is somewhat related.

@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 16, 2020

Thanks a lot! One blocker, and a nice to have fix would be to remove the dead bootstrap of tools used to compile the native gRPC code:

# Exports PATH, GOROOT, PROTOC, AR.
# shellcheck source=build-support/bin/download_native_build_binaries.sh
source "${REPO_ROOT}/build-support/bin/download_native_build_binaries.sh"
cargo_bin="${CARGO_HOME}/bin/cargo"
if [[ -n "${CARGO_WRAPPER_DEBUG:-}" ]]; then
cat << DEBUG >&2
>>> Executing ${cargo_bin} $@
>>> In ENV:
>>> GOROOT=${GOROOT}
>>> PATH=${PATH}
>>> PROTOC=${PROTOC}
>>> AR=${AR:-<not explicitly set>}
>>>
DEBUG
fi

If you're ok with waiting until after I cut the next dev release (tomorrow morning?), I'd be ok with landing with the stack size workaround in place, as long as we can fix it before the next dev release.

I'm fine with waiting. Gives me time to try and figure out the stack size issue. Frankly, I don't know if it is the remote execution code, the remote store code, or the test servers.

@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 16, 2020

and a nice to have fix would be to remove the dead bootstrap of tools used to compile the native gRPC code:

I can tackle this if you'd prefer. I'm still trying to land #11256, which is somewhat related.

@Eric-Arellano: Yes, please. I am not familiar with what tooling exactly was required for building grpcio other than cmake. So doing the removal in a separate PR would be good.

@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 16, 2020

Partial fix for the stack issue: I removed the with_byte_stream_client and with_cas_client function on ByteStore (storing those clients on the ByteStore directly). This helped fix the stack issue when running tests in fs/store. However, process_execution still has the issue.

@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 17, 2020

Tracked the cause of the stack overflow to

if let Some(workunit_state) = workunit_store::get_workunit_state() {
let store = workunit_state.store;
with_workunit(store, workunit_name, metadata, result_future, |_, md| md).await
} else {
result_future.await
}
}

The stack overflow does not occur if with_workunit is not called with the future in that function. That is, just calling result_future.await is fine, but passing it to with_workunit blows up the stack.

@tdyas tdyas force-pushed the tonic_grpc_refactor branch from cc0527a to 7b3be4a Compare December 17, 2020 05:43
@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 17, 2020

Solved the issue but just boxing the future being passed to with_workunit.

Comment thread src/rust/engine/fs/brfs/src/main.rs Outdated
Comment thread src/rust/engine/fs/store/src/snapshot_ops.rs Outdated
Comment thread src/rust/engine/fs/store/src/snapshot_tests.rs
Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated
Comment thread src/rust/engine/process_execution/src/remote_cache.rs Outdated
Comment thread src/rust/engine/testutil/mock/src/action_cache.rs Outdated
Comment thread src/rust/engine/testutil/mock/src/action_cache.rs Outdated
Comment thread src/rust/engine/fs/store/src/snapshot_ops.rs Outdated
Comment thread src/rust/engine/testutil/mock/src/execution_server.rs Outdated
@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 17, 2020

Added some commits to actually get the Tonic-based code when invoked on Toolchain's internal repo: (1) Enter the executor when initializing the scheduler so that Tonic can use tokio::spawn and (2) add HTTP/2 to the ALPN part of the TLS config so that connections actually occur. With those two changes, running this code in remote caching mode against Toolchain's remote cache cluster works.

@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 17, 2020

(Pants needs a better story around integration testing of the remote cache and execution code, or at least a way to have unit/functional tests of the setup/init logic.)

@tdyas
Copy link
Copy Markdown
Contributor Author

tdyas commented Dec 17, 2020

@illicitonion: I made the suggested cleanups for digest conversion. Thanks!

Copy link
Copy Markdown
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hurrah! Thanks!

@tdyas tdyas merged commit 6cc9fd3 into pantsbuild:master Dec 18, 2020
Eric-Arellano added a commit that referenced this pull request Dec 18, 2020
Thanks to #11307, we no longer need to install all these tools.

[ci skip-build-wheels]
tdyas pushed a commit that referenced this pull request Dec 19, 2020
### Problem

#11307 upgraded the gRPC code to [Tonic](https://github.com/hyperium/tonic) from grpcio. As part of that upgrade, Pants now uses [Prost](https://github.com/danburkert/prost) to generate Rust structs for protobuf types. By default, Prost encodes binary fields as `Vec<u8>`. Given the attendant problems of `Vec<u8>` of needing to copy around bytes when structs are cloned, the more efficient method is to use `Bytes` for the representation to avoid unnecessary copies.

### Solution

Upgrade Prost to danburkert/prost@a1cccbc and enable using `Bytes` for all binary fields.

### Result

Existing tests pass.
tdyas pushed a commit that referenced this pull request Dec 19, 2020
Follow-up to #11307 (comment) to add a `to_bytes` helper as an extension function to encode `prost::Message` protobuf types to `Bytes`. Includes roundtrip unit test.
@tdyas tdyas deleted the tonic_grpc_refactor branch February 20, 2023 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants