Hi!
Every day we have some builds failing with the following error:
#25 exporting cache
#25 preparing build cache for export
#25 preparing build cache for export 73.8s done
#25 writing layer sha256:somesha
#25 78.76 error: failed to authorize: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded
#25 78.76 retrying in 1s
#25 84.76 error: failed to authorize: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded
#25 84.76 retrying in 2s
#25 91.76 error: failed to authorize: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded
#25 91.76 retrying in 4s
#25 writing layer sha256:somesha 27.0s done
#25 100.8 error: failed to authorize: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded
#25 ERROR: error writing layer blob: failed to authorize: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded
#26 ** export finalization failed - continuing anyway: error writing layer blob: failed to authorize: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded (that's custom code)
#26 DONE 0.0s
It affected 36 builds out of 883 in the past 24 hours (approx. 4%). This is likely related to docker/buildx#456 but I believe it deserves an issue in this project since it’s where the error is coming from.
Timeline
The session identifiers are legit. The problem is that is the session is gone when trying to fetch registry credentials, from a couple of seconds up to a couple of minutes. Here is an example of timeline:
Why the session is gone is not entirely clear, but we're doing a lot of concurrent builds that share part of their stages. I hope I didn't made an error when wrapping up the timeline. If anything seems weird, don't hesitate to ask for verification or additional traces. We're using version 0.9.0 on linux/amd64.
Any idea what the approach should be to recover from such an error ?
Workaround
For the time being, we came up with a workaround which is to ignore errors when exporting the cache images, instead of failing our pipelines. We achieved adding something like the following snippet around
|
cacheExporterResponse, err = e.Finalize(ctx) |
if exportFinalizationFailure {
inBuilderContext(ctx, j, fmt.Sprintf("** export finalization failed - continuing anyway: %v", err), "", func(_ context.Context, _ session.Group) error { return nil })
} else {
return nil, err
}
Hi!
Every day we have some builds failing with the following error:
It affected 36 builds out of 883 in the past 24 hours (approx. 4%). This is likely related to docker/buildx#456 but I believe it deserves an issue in this project since it’s where the error is coming from.
Timeline
The session identifiers are legit. The problem is that is the session is gone when trying to fetch registry credentials, from a couple of seconds up to a couple of minutes. Here is an example of timeline:
00:14.578Zcontroller.Solve() is called for job z63yeomm82pcvp171rd4zyk3q, session d64hnddm9jssidwj6f9vv2w3700:14.578Zcontroller.Status() is called for job z63yeomm82pcvp171rd4zyk3q00:14.578Zllbsolver.Status() tries to retrieve job z63yeomm82pcvp171rd4zyk3q and waits00:14.578Zsessionmanager.handleConn() session d64hnddm9jssidwj6f9vv2w37 is created00:14.578Zsessionmanager.handleConn() waits for session d64hnddm9jssidwj6f9vv2w37's client context to finish00:14.579Zllbsolver.Solve() new job z63yeomm82pcvp171rd4zyk3q, session d64hnddm9jssidwj6f9vv2w3700:14.579Zllbsolver.Status() found job z63yeomm82pcvp171rd4zyk3q00:18.963ZauthHandler.doBearerAuth() fetching token for session d64hnddm9jssidwj6f9vv2w3700:19.317ZauthHandler.doBearerAuth() fetched token for session d64hnddm9jssidwj6f9vv2w3700:36.549Zllbsolver.Solve() starts export for session d64hnddm9jssidwj6f9vv2w37, job z63yeomm82pcvp171rd4zyk3q01:15.139Zllbsolver.Solve() export done for session d64hnddm9jssidwj6f9vv2w37, job z63yeomm82pcvp171rd4zyk3q01:15.139Zllbsolver.Solve() preparing build cache export for session d64hnddm9jssidwj6f9vv2w37, job z63yeomm82pcvp171rd4zyk3q01:29.499Zllbsolver.Solve() discards job z63yeomm82pcvp171rd4zyk3q, session: d64hnddm9jssidwj6f9vv2w3701:29.500Zsessionmanager.handleConn() client context for session d64hnddm9jssidwj6f9vv2w37 is done: context canceled01:29.500Zsessionmanager.handleConn() session d64hnddm9jssidwj6f9vv2w37 is deleted06:26.926Zllbsolver.Solve() finalizing export06:26.926ZauthHandler.doBearerAuth() fetching token for session d64hnddm9jssidwj6f9vv2w3706:31.927Zsessionmanager.Any() err: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded06:37.929Zsessionmanager.Any() err: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded06:44.930Zsessionmanager.Any() err: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded06:53.932Zsessionmanager.Any() err: no active session for d64hnddm9jssidwj6f9vv2w37: context deadline exceeded06:53.932Zllbsolver.Solve() writes the custom line** export finalization failed - continuing anywayWhy the session is gone is not entirely clear, but we're doing a lot of concurrent builds that share part of their stages. I hope I didn't made an error when wrapping up the timeline. If anything seems weird, don't hesitate to ask for verification or additional traces. We're using version 0.9.0 on linux/amd64.
Any idea what the approach should be to recover from such an error ?
Workaround
For the time being, we came up with a workaround which is to ignore errors when exporting the cache images, instead of failing our pipelines. We achieved adding something like the following snippet around
buildkit/solver/llbsolver/solver.go
Line 241 in b055d2d