pkg/steps: add build pending timeout period#3427
pkg/steps: add build pending timeout period#3427openshift-merge-robot merged 4 commits intoopenshift:masterfrom
Conversation
|
Ready for review, but let's not deploy it this week. /hold |
|
/hold cancel |
8f00f3e to
629be48
Compare
|
/test e2e |
32aa3d0 to
8c3a30d
Compare
|
/assign |
| } | ||
|
|
||
| func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name string) error { | ||
| func waitForBuild( |
There was a problem hiding this comment.
This function is complex, I have read it over and over and every time I find a new corner case I didn't take into the account before. We can't merge it without writing down a (decently) detailed description about what it is supposed to do.
There was a problem hiding this comment.
It is as complex as the task it has to perform (and hopefully no more). I've added both documentation and comments detailing everything that is involved in it, based on the questions here.
It may help to contrast it to waitForPodCompletionOrTimeout, on which it is based (a funny fact, since that one performs a much more complex task, and I seem to have got that one correctly but not this one).
| evaluatorFunc := func(obj runtime.Object) (bool, error) { | ||
| switch build := obj.(type) { | ||
| case *buildapi.Build: | ||
| var ret atomic.Pointer[buildapi.Build] |
There was a problem hiding this comment.
The use we make of this pointer is tricky (nasty?): the *buildapi.Build object here could be different from the following line if in the meantime a swap happened due to a change of status. The whole logic is probably correct but is non-trivial. Wouldn't a channel be more appropriate in this case where you are trying to make two goroutines communicate?
There was a problem hiding this comment.
The use we make of this pointer is tricky (nasty?):
The (atomic) pointer provides the latest version of the object as seen by the watch.
The whole logic is probably correct but is non-trivial. Wouldn't a channel be more appropriate in this case where you are trying to make two goroutines communicate?
We could use a channel, but I don't see how that makes things clearer.
- it would just be a send/receive instead of a store/load
- it would require an extra boolean variable to know when to start the pending check
- there is only ever one load from the pointer, so the channel would be single-use
Other than those reasons, this is analogous to the same pattern in the function linked above which watches pods, where the pointer makes things even simpler. I did consider using a channel there, feeding both the watch and the pending check events through it, and processing both streams uniformly. However, those two events are unrelated: we know exactly when and what needs to be performed for the pending check (in both functions), and mixing the treatment in this way makes no sense.
the *buildapi.Build object here could be different from the following line if in the meantime a swap happened due to a change of status.
As desired. The first load is used to determine the point at which the verification needs to be performed, based on the creation timestamp. The second (which happens at a much later point in time) has to perform another load since it must consider the latest version of the object. As mentioned in the description, not doing this was one of the problems in the original implementation.
There was a problem hiding this comment.
I have tried to quickly modify the code using a channel and it turned out to be even more convoluted so, again (see the comment below), nether I am in favor to go down this path anymore.
The main problem was that I had to rebuild your mental model of the function (see here from Cognitive Load Developer's Handbook), which carry on a lot of hidden details and assumptions, so I was trying to reduce a little bit the complexity starting by eliminating the communication through the atomic pointer.
Comments were probably the real missing piece, I can now "see" the what you intended and feel more relaxed.
| build := obj.(*buildapi.Build) | ||
| first := ret.Swap(build) == nil | ||
| switch build.Status.Phase { | ||
| case buildapi.BuildPhaseNew, buildapi.BuildPhasePending: |
There was a problem hiding this comment.
Why
eg.SetLimit(1)
// ...
switch build.Status.Phase {
case buildapi.BuildPhaseNew:
eg.Go(pendingCheck)
// ...
}would not be sufficient/correct?
There was a problem hiding this comment.
Can you expand?
- I don't understand the limit of 1, since the pending check must execute in parallel.
- We may miss the "new" phase if there is a delay between creating the build and establishing the watch, or if we are a separate execution entirely.
- This check must be started only once, while we may see several updates in the "new" phase.
- (If you purposefully did not include the shared pointer) there is no need for a separate
get(we used to this for the pod watch) since we've already received the information from the watch and it is guaranteed to be the latest available.
There was a problem hiding this comment.
eg.SetLimit(1)
My bad, I should have written eg.SetLimit(2) instead, therefore the whole code would look like
eg.SetLimit(2)
// ...
switch build.Status.Phase {
case buildapi.BuildPhaseNew:
_ = eg.TryGo(pendingCheck)
// ...
}but it's maybe more hidden and ambiguous, so no way.
The rest of the explanation makes sense to me, thanks.
There was a problem hiding this comment.
I see, that makes sense. That would be one way of having an extra flag indicating whether it has already started, as I said above.
| return kubernetes.WaitForConditionOnObject(ctx, buildClient, ctrlruntimeclient.ObjectKey{Namespace: namespace, Name: name}, &buildapi.BuildList{}, &buildapi.Build{}, func(obj runtime.Object) (bool, error) { | ||
| build := obj.(*buildapi.Build) | ||
| first := ret.Swap(build) == nil | ||
| switch build.Status.Phase { |
There was a problem hiding this comment.
All right this is more a vague consideration rather than anything else. The --pod-pending-timeout flag states that it controls the:
"Maximum amount of time created containers can spend before the running state"
but this switch does not even mention the state buildapi.BuildPhaseRunning in any place.
This obviously means nothing, but I feel like this logic is a little bit hidden.
There was a problem hiding this comment.
I agree the argument's description is not ideal, updated.
a04c152 to
1399292
Compare
|
Wow, we enforce even the format of documentation strings? |
This implementation is similar (and uses the same time period) as the one in
`WaitForPodCompletion` in `pkg/util/pods.go`, although the pending timeout
verification for builds is considerable simpler since only one time point ---
the creation time --- is considered.
---
Example (with a pending timeout of `1s`):
```yaml
resources:
src:
requests:
memory: 1T
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending):
Found 1 events for Pod src-build:
* 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate.
```
There is an unfortunate layering violation in that we are forced to get to the
build pod through the `pod-name` annotation and examine it. I could not find a
way to do this through the `Build` object (as is possible for logs, for
example). A pending build has very little information:
```
status:
conditions:
- lastTransitionTime: "2022-06-17T11:06:41Z"
lastUpdateTime: "2022-06-17T11:06:41Z"
status: "False"
type: New
- lastTransitionTime: "2022-06-17T11:06:41Z"
lastUpdateTime: "2022-06-17T11:06:41Z"
status: "True"
type: Pending
output: {}
outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src
phase: Pending
```
It is only by examining the build pod (which at least is done by the existing
code used for test pods) that the cause can be determined and reported.
1399292 to
2965020
Compare
|
/test e2e |
| } | ||
|
|
||
| func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name string) error { | ||
| func waitForBuildOrTimeout( |
There was a problem hiding this comment.
waitForBuildOrTimeout - I would expect from the extra OrTimeout suffix that this function is the same as waitForBuild but I accepts an extra timeout time.Duration argument. Is it reasonable?
There was a problem hiding this comment.
Note that I'm not changing the name, as the diff output can seem to indicate. What is funny is the original implementation did not have a timeout. It just waited forever, despite its name.
We now do have a timeout parameter, only it's passed implicitly via the podClient (later retrieved by its GetPodPendingTimeout method). I thought about making it explicit, but it didn't seem necessary since we can get it directly from the client object.
|
It looks reasonably good to me. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bbguimaraes, danilo-gemoli The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
We need a |
|
@bbguimaraes: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Previous attempt: #3390.
This version fixes a couple of problems with the previous implementation:
Buildpointer needs to be updated every time a new object is seen.unhelpful.
This resulted in builds being rejected if their execution exceeded the pending
timeout period:
This pull request contains changed versions of the original commits so that it
does not introduce (more) revisions with problems. Here is a
range-diff-basedcomparison of the previous and current versions, edited for ease of review:
@@ pkg/steps/source.go: func hintsAtInfraReason(logSnippet string) bool { pendingCheck := func() error { timeout := podClient.GetPendingTimeout() - t0 := ret.Load().CreationTimestamp select { case <-pendingCtx.Done(): - case <-time.After(time.Until(t0.Add(timeout))): - err := util.PendingBuildError(ctx, podClient, ret.Load()) - logrus.Infof(err.Error()) - return err + case <-time.After(time.Until(ret.Load().CreationTimestamp.Add(timeout))): + if err := checkPending(ctx, podClient, ret.Load(), timeout, time.Now()); err != nil { + logrus.Infof(err.Error()) + return err + } } return nil } @@ pkg/steps/source.go: func hintsAtInfraReason(logSnippet string) bool { return kubernetes.WaitForConditionOnObject(ctx, buildClient, ctrlruntimeclient.ObjectKey{Namespace: namespace, Name: name}, &buildapi.BuildList{}, &buildapi.Build{}, func(obj runtime.Object) (bool, error) { build := obj.(*buildapi.Build) + first := ret.Swap(build) == nil switch build.Status.Phase { + case buildapi.BuildPhaseNew, buildapi.BuildPhasePending: + if first { + eg.Go(pendingCheck) + } @@ pkg/steps/source.go: func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name - default: - if ret.Swap(build) == nil { - eg.Go(pendingCheck) - } … +func checkPending( + ctx context.Context, + podClient kubernetes.PodClient, + build *buildapi.Build, + timeout time.Duration, + now time.Time, +) error { + switch build.Status.Phase { + case buildapi.BuildPhaseNew, buildapi.BuildPhasePending: + if build.CreationTimestamp.Add(timeout).Before(now) { + return util.PendingBuildError(ctx, podClient, build) + } + } + return false, nil +}