buildctl: Provide --wait option#3586
Conversation
|
Hmm, looks like I've run into golangci/golangci-lint#3101 as my |
6e4d3ab to
6bee144
Compare
|
With some guidance, I may be able to implement an integration test for this new behavior, but as it stands I'm not that familiar with the functional/integration tests. Here's an ad-hoc test I ran in minikube with Istio and buildkitd deployed and a DestinationRule that limits max tcp connections to 1. Without this change (2 out of 3 builds fail due to max connections being 1): With this change (no build failures): |
tonistiigi
left a comment
There was a problem hiding this comment.
Not sure I understand this. Looks like it just calls ListWorkers twice in a row now, while previously, it called it once. I don't really like introducing extra roundtrips into code paths where performance is important.
tonistiigi
left a comment
There was a problem hiding this comment.
We already have support for FailFast that I think is the opposite thing. Iirc, without it the user experience isn't great because when the user makes a configuration error, things will just hang forever, instead of returning a proper error quickly.
Ah, I see. So perhaps the use of If something like a |
Yeah, I think so. |
As this is client side I assume you mean |
Taking a closer look, it looks like As for hanging forever, that's why I used the value of Perhaps I could refactor this PR to be an explicit |
Looking again I believe this is because the context is only used for the dial. That would make sense. |
Even if it hangs for 20-30s it is still a broken experience to the user. They don't know what is going on.
Opt-in flags in |
Thanks! I'll refactor in that direction.
I'll see about providing user feedback when Just to clarify some of the behavior, too. This should not block in the case of outright connection failure. |
6bee144 to
97e3a4f
Compare
|
I was able to refactor as an opt-in for |
|
@tonistiigi I'm not sure why CI is failing. It appears to be during riscv64 cross compilation which I have no knowledge of. |
There was an issue with some cross-compile library changes. I restarted it. If it doesn't work, you might need to rebase on top of the latest master. |
| --tlskey value client key | ||
| --tlsdir value directory containing CA certificate, client certificate, and client key | ||
| --timeout value timeout backend connection after value seconds (default: 5) | ||
| --wait-for-ready secs block calls upon transient connection failures for up to the given secs |
There was a problem hiding this comment.
The text doesn’t seem to match the actual implementation
|
Needs rebase |
|
Hey @marxarelli, since you've not updated for a bit, I force-pushed a commit with a slightly different implementation ❤️ Let me know what your thoughts are on it. I started looking into this there was a merge conflict and we'd like to take this for the upcoming v0.12 release, and there's some interesting conflicts with #3740 and #3761, but then realized we could do some additional refactoring if we wanted (to remove the need for both Essentially, I reworked it to be more similar to how we do this in buildx today, and renamed the option to I think we should also be able to borrow the client-related logic into buildx at some point, instead of needing to do the polling (which should be a nice easy perf-improvement for the remote driver). Let me know what you think! (cc @AkihiroSuda @tonistiigi, if you could as well ❤️) |
97e3a4f to
892dac3
Compare
tonistiigi
left a comment
There was a problem hiding this comment.
This works but I guess gRPC has a more optimal options for achieving this without the need to actually call a method. Btw, ListWorkers is not actually a very cheap call. Iirc it performs a recheck of all the emulators that are installed to make sure it returns accurate information.
|
@jedevc, @tonistiigi, I just want to say thank you very much for moving this forward in my absence, and sorry I disappeared like that. If there's anything still needed from me here, I once again have time to dedicate. |
IIRC, in my first iteration of this PR I had looked into performing a gRPC health check instead but last I looked that was only implemented for the gateway and not buildkitd. If that changes in the future, it seems like the better option for a low cost preflight method of ensuring a healthy connection/session. |
|
@jedevc PTAL #3586 (review) |
Added `--wait` to buildctl's global options. See below for behavior. Implemented a `Wait` client method that blocks until a successful request has been made to the remote buildkit. This behavior is identical as in buildx, and only makes additional ListWorker calls *if the user has requested them*. The timeout as requested using the `--timeout` option is additionally applied here. Co-authored-by: Justin Chadwell <me@jedevc.com> Signed-off-by: Justin Chadwell <me@jedevc.com>
892dac3 to
ef61bbe
Compare
|
Sorry for the delay, I've updated to use Older BuildKit versions don't support this endpoint, but that's fine - we can just check for the I think a healthcheck might still be a good idea? If we do that, we should switch to that instead, but I'm not convinced we need to add a whole new API right now just for this functionality. |
| Value: 5, | ||
| }, | ||
| cli.BoolFlag{ | ||
| Name: "wait", |
There was a problem hiding this comment.
This isn't consistent with the flag mentioned in the PR title
There was a problem hiding this comment.
Personally, I prefer the shorter form wait - I'll update the PR title to match.
Happy to discuss though, I don't have strong opinions.
|
Thanks to all for taking the time to review and refactor my original contribution. ❤️ It has been much appreciated. |
Provide a
WaitForReadyclient method that performs a preflight request and specifiesgrpc.WaitForReady(true)to ensure that thegrpc.ClientConnhas established the underlying connection and that it can be considered available.Performing this request prior to solves makes the client more robust in environments where the server is behind a proxy or part of a service mesh (e.g. Istio/Envoy). In these environments, connections may be prematurely closed prior to any client requests due to circuit breaking on max connections.
For the moment, this incurs a redundant request to
ListWorkerswhich seemed to be the more backwards compatible request to make as a preflight;Infois not available in older versions. If buildkitd ever implements thegrpc.health.v1.Healthdirectly on its server endpoint, agrpc.health.v1.Health/Checkmay make more sense.Signed-off-by: Dan Duvall dduvall@wikimedia.org