Skip to content

buildctl: Provide --wait option#3586

Merged
tonistiigi merged 1 commit intomoby:masterfrom
marxarelli:review/client-session-connection-retry
Jun 29, 2023
Merged

buildctl: Provide --wait option#3586
tonistiigi merged 1 commit intomoby:masterfrom
marxarelli:review/client-session-connection-retry

Conversation

@marxarelli
Copy link
Copy Markdown
Contributor

Provide a WaitForReady client method that performs a preflight request and specifies grpc.WaitForReady(true) to ensure that the grpc.ClientConn has established the underlying connection and that it can be considered available.

Performing this request prior to solves makes the client more robust in environments where the server is behind a proxy or part of a service mesh (e.g. Istio/Envoy). In these environments, connections may be prematurely closed prior to any client requests due to circuit breaking on max connections.

For the moment, this incurs a redundant request to ListWorkers which seemed to be the more backwards compatible request to make as a preflight; Info is not available in older versions. If buildkitd ever implements the grpc.health.v1.Health directly on its server endpoint, a grpc.health.v1.Health/Check may make more sense.

Signed-off-by: Dan Duvall dduvall@wikimedia.org

@marxarelli
Copy link
Copy Markdown
Contributor Author

Hmm, looks like I've run into golangci/golangci-lint#3101 as my gofmt says everything is A-ok. I'll install go 1.19 and recheck.

@marxarelli marxarelli force-pushed the review/client-session-connection-retry branch from 6e4d3ab to 6bee144 Compare February 6, 2023 03:15
@marxarelli
Copy link
Copy Markdown
Contributor Author

With some guidance, I may be able to implement an integration test for this new behavior, but as it stands I'm not that familiar with the functional/integration tests.

Here's an ad-hoc test I ran in minikube with Istio and buildkitd deployed and a DestinationRule that limits max tcp connections to 1.

/tmp # cat > Dockerfile
FROM busybox
RUN sleep 3

Without this change (2 out of 3 builds fail due to max connections being 1):

/tmp # seq 1 3 | xargs -n 1 -P 3 buildctl --timeout 60 --addr tcp://buildkitd:1234 build --progress plain --frontend dockerfile.v0 --local context=. --local dockerfile=. --no-cache
error: listing workers for Build: failed to list workers: Unavailable: write tcp 172.17.0.6:51798->10.107.217.79:1234: use of closed network connection
error: listing workers for Build: failed to list workers: Unavailable: write tcp 172.17.0.6:51802->10.107.217.79:1234: use of closed network connection
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 62B done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/busybox:latest
#3 DONE 0.8s

#4 [1/2] FROM docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c
#4 resolve docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c done
#4 CACHED

#5 [2/2] RUN sleep 3
#5 DONE 3.1s

With this change (no build failures):

/tmp # seq 1 3 | xargs -n 1 -P 3 buildctl --timeout 60 --addr tcp://buildkitd:1234 build --progress plain --frontend dockerfile.v0 --local context=. --local dockerfile=. --no-cache
#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 62B 0.0s done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/busybox:latest
#3 DONE 0.7s

#4 [1/2] FROM docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c
#4 resolve docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c 0.0s done
#4 CACHED

#5 [2/2] RUN sleep 3
#5 DONE 3.1s
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 62B done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/busybox:latest
#3 DONE 0.5s

#4 [1/2] FROM docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c
#4 resolve docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c 0.0s done
#4 CACHED

#5 [2/2] RUN sleep 3
#5 DONE 3.1s
#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 62B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/busybox:latest
#3 DONE 0.5s

#4 [1/2] FROM docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c
#4 resolve docker.io/library/busybox@sha256:7b3ccabffc97de872a30dfd234fd972a66d247c8cfc69b0550f276481852627c 0.0s done
#4 CACHED

#5 [2/2] RUN sleep 3
#5 DONE 3.1s

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this. Looks like it just calls ListWorkers twice in a row now, while previously, it called it once. I don't really like introducing extra roundtrips into code paths where performance is important.

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have support for FailFast that I think is the opposite thing. Iirc, without it the user experience isn't great because when the user makes a configuration error, things will just hang forever, instead of returning a proper error quickly.

@marxarelli
Copy link
Copy Markdown
Contributor Author

marxarelli commented Feb 8, 2023

We already have support for FailFast that I think is the opposite thing. Iirc, without it the user experience isn't great because when the user makes a configuration error, things will just hang forever, instead of returning a proper error quickly.

Ah, I see. So perhaps the use of FailFast should be optional—or opt-out via a client CLI flag. There are use cases—like the one I described—where network connection failover is much more desirable than quick failure. The grpc.ClientConn implements connection retry and backoff already for transient network failure, and that is a valuable behavior to keep when running in a service mesh environment.

If something like a --no-fail-fast option would be an acceptable change, let me know and I will implement that as a new PR. Meanwhile I will verify if that would have the same effect as this change.

@tonistiigi
Copy link
Copy Markdown
Member

If something like a --no-fail-fast option would be an acceptable change

Yeah, I think so.

@tonistiigi
Copy link
Copy Markdown
Member

If something like a --no-fail-fast option would be an acceptable change

Yeah, I think so.

As this is client side I assume you mean buildctl? buildctl is fine but for something like Docker Buildx we would need to think harder.

@marxarelli
Copy link
Copy Markdown
Contributor Author

We already have support for FailFast that I think is the opposite thing. Iirc, without it the user experience isn't great because when the user makes a configuration error, things will just hang forever, instead of returning a proper error quickly.

Taking a closer look, it looks like client.WithFailFast() results in use of grpc.FailOnNonTempDialError(true) which is not the same as disabling fail-fast behavior in the grpc client connection handler. I've verified that disabling it does not have the same effect as my patch. Using grpc.WaitForReady(true) is the right way to accomplish this according to the documentation.

As for hanging forever, that's why I used the value of --timeout to enforce a timeout on the context. Which I was surprised was not the case already since the flag already results in adding a timeout to the client context, but in my tests I didn't observe an enforcement of the timeout until i implemented it for the context of the call in client.WaitForReady().

Perhaps I could refactor this PR to be an explicit --wait-for-ready <timeout> option? This could be added to the default call options, and I could remove client.WaitForReady to avoid the extra round trip.

@marxarelli
Copy link
Copy Markdown
Contributor Author

Which I was surprised was not the case already since the flag already results in adding a timeout to the client context, but in my tests I didn't observe an enforcement of the timeout until i implemented it for the context of the call in client.WaitForReady().

Looking again I believe this is because the context is only used for the dial. That would make sense.

@tonistiigi
Copy link
Copy Markdown
Member

As for hanging forever, that's why I used the value of --timeout to enforce a timeout on the context.

Even if it hangs for 20-30s it is still a broken experience to the user. They don't know what is going on.

Perhaps I could refactor this PR to be an explicit --wait-for-ready

Opt-in flags in buildctl for these cases SGTM

@marxarelli
Copy link
Copy Markdown
Contributor Author

Perhaps I could refactor this PR to be an explicit --wait-for-ready

Opt-in flags in buildctl for these cases SGTM

Thanks! I'll refactor in that direction.

As for hanging forever, that's why I used the value of --timeout to enforce a timeout on the context.

Even if it hangs for 20-30s it is still a broken experience to the user. They don't know what is going on.

I'll see about providing user feedback when --wait-for-ready is used as well.

Just to clarify some of the behavior, too. This should not block in the case of outright connection failure. WaitForReady only takes effect when the connection is in TRANSIENT_FAILURE.

@marxarelli marxarelli force-pushed the review/client-session-connection-retry branch from 6bee144 to 97e3a4f Compare February 11, 2023 00:24
@marxarelli marxarelli changed the title client: Ensure active/healthy connection prior to solve buildctl: Provide --wait-for-ready option Feb 11, 2023
@marxarelli
Copy link
Copy Markdown
Contributor Author

I was able to refactor as an opt-in for buildctl and used an interceptor to minimize the changes to the client code.

@marxarelli
Copy link
Copy Markdown
Contributor Author

@tonistiigi I'm not sure why CI is failing. It appears to be during riscv64 cross compilation which I have no knowledge of.

@tonistiigi
Copy link
Copy Markdown
Member

It appears to be during riscv64 cross compilation which I have no knowledge of.

There was an issue with some cross-compile library changes. I restarted it. If it doesn't work, you might need to rebase on top of the latest master.

Comment thread docs/buildctl.md Outdated
--tlskey value client key
--tlsdir value directory containing CA certificate, client certificate, and client key
--timeout value timeout backend connection after value seconds (default: 5)
--wait-for-ready secs block calls upon transient connection failures for up to the given secs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text doesn’t seem to match the actual implementation

@AkihiroSuda
Copy link
Copy Markdown
Member

Needs rebase

@tonistiigi tonistiigi added this to the v0.12.0 milestone May 22, 2023
@jedevc
Copy link
Copy Markdown
Member

jedevc commented May 23, 2023

Hey @marxarelli, since you've not updated for a bit, I force-pushed a commit with a slightly different implementation ❤️ Let me know what your thoughts are on it. I started looking into this there was a merge conflict and we'd like to take this for the upcoming v0.12 release, and there's some interesting conflicts with #3740 and #3761, but then realized we could do some additional refactoring if we wanted (to remove the need for both timeout and wait-for-ready to both take a number of seconds). Maybe I've missed something, that you originally wanted from this feature! (thankfully github saves the history of changed PRs so we can always restore it if you don't like it 🎉)

Essentially, I reworked it to be more similar to how we do this in buildx today, and renamed the option to --wait (we'll see, but I think this is a bit cleaner). Instead of adding a new client option, I've added a new client method called Wait which makes a single ListWorkers call using the grpc.WaitForReady call option using the context derived from the --timeout option. With that, we don't need --wait to take a number of seconds, we can just pull that from the existing --timeout option entirely.

I think we should also be able to borrow the client-related logic into buildx at some point, instead of needing to do the polling (which should be a nice easy perf-improvement for the remote driver).

Let me know what you think! (cc @AkihiroSuda @tonistiigi, if you could as well ❤️)

@jedevc jedevc force-pushed the review/client-session-connection-retry branch from 97e3a4f to 892dac3 Compare May 23, 2023 10:59
@jedevc jedevc requested review from AkihiroSuda and tonistiigi May 23, 2023 10:59
Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but I guess gRPC has a more optimal options for achieving this without the need to actually call a method. Btw, ListWorkers is not actually a very cheap call. Iirc it performs a recheck of all the emulators that are installed to make sure it returns accurate information.

@marxarelli
Copy link
Copy Markdown
Contributor Author

@jedevc, @tonistiigi, I just want to say thank you very much for moving this forward in my absence, and sorry I disappeared like that.

If there's anything still needed from me here, I once again have time to dedicate.

@marxarelli
Copy link
Copy Markdown
Contributor Author

This works but I guess gRPC has a more optimal options for achieving this without the need to actually call a method. Btw, ListWorkers is not actually a very cheap call. Iirc it performs a recheck of all the emulators that are installed to make sure it returns accurate information.

IIRC, in my first iteration of this PR I had looked into performing a gRPC health check instead but last I looked that was only implemented for the gateway and not buildkitd. If that changes in the future, it seems like the better option for a low cost preflight method of ensuring a healthy connection/session.

@tonistiigi
Copy link
Copy Markdown
Member

@jedevc PTAL #3586 (review)

Added `--wait` to buildctl's global options. See below for behavior.

Implemented a `Wait` client method that blocks until a successful
request has been made to the remote buildkit. This behavior is identical
as in buildx, and only makes additional ListWorker calls *if the user
has requested them*.

The timeout as requested using the `--timeout` option is additionally
applied here.

Co-authored-by: Justin Chadwell <me@jedevc.com>
Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc force-pushed the review/client-session-connection-retry branch from 892dac3 to ef61bbe Compare June 29, 2023 09:55
@jedevc
Copy link
Copy Markdown
Member

jedevc commented Jun 29, 2023

Sorry for the delay, I've updated to use Info instead, which is a much cheaper call.

Older BuildKit versions don't support this endpoint, but that's fine - we can just check for the Unimplemented grpc code, even though it's an error response, it's still a response, so we can move forward assuming that the server is still up and ready.

I think a healthcheck might still be a good idea? If we do that, we should switch to that instead, but I'm not convinced we need to add a whole new API right now just for this functionality.

Comment thread cmd/buildctl/main.go
Value: 5,
},
cli.BoolFlag{
Name: "wait",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't consistent with the flag mentioned in the PR title

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I prefer the shorter form wait - I'll update the PR title to match.

Happy to discuss though, I don't have strong opinions.

@jedevc jedevc changed the title buildctl: Provide --wait-for-ready option buildctl: Provide --wait option Jun 29, 2023
@tonistiigi tonistiigi merged commit 76ecda8 into moby:master Jun 29, 2023
@marxarelli
Copy link
Copy Markdown
Contributor Author

Thanks to all for taking the time to review and refactor my original contribution. ❤️ It has been much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants