Skip to content

pkg/steps: add build pending timeout period#3427

Merged
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
bbguimaraes:build_pending
May 29, 2023
Merged

pkg/steps: add build pending timeout period#3427
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
bbguimaraes:build_pending

Conversation

@bbguimaraes
Copy link
Copy Markdown
Contributor

Previous attempt: #3390.


This version fixes a couple of problems with the previous implementation:

  • The Build pointer needs to be updated every time a new object is seen.
  • The pending check was omitted when the timer expired, which was very
    unhelpful.

This resulted in builds being rejected if their execution exceeded the pending
timeout period:

$ yq '.images[]|select(.to=="sleep")' config.yaml
{
  "to": "sleep",
  "from": "root",
  "dockerfile_literal": "FROM root\nRUN sleep 1m\n"
}
$ git log -n 1 --oneline
614711426 (HEAD) Merge pull request #3390 from bbguimaraes/build_pending
$ go install ./cmd/ci-operator
$ ci-operator --config config.yaml --pod-pending-timeout 30s --target sleep …

INFO[2023-05-11T12:48:16Z] Created build "sleep"
INFO[2023-05-11T12:48:46Z] build didn't start running within 30s (phase: Running):
$ git log -n 1 --oneline
ef4000f84 (HEAD) pkg/steps: add pod events to build pending error
$ go install ./cmd/ci-operator
$ ci-operator --config config.yaml --pod-pending-timeout 30s --target sleep …

INFO[2023-05-11T12:51:53Z] Building sleep
INFO[2023-05-11T12:51:53Z] Created build "sleep"
INFO[2023-05-11T12:53:34Z] Build sleep succeeded after 1m41s
INFO[2023-05-11T12:53:35Z] Ran for 1m47s

This pull request contains changed versions of the original commits so that it
does not introduce (more) revisions with problems. Here is a range-diff-based
comparison of the previous and current versions, edited for ease of review:

@@ pkg/steps/source.go: func hintsAtInfraReason(logSnippet string) bool {
 	pendingCheck := func() error {
 	        timeout := podClient.GetPendingTimeout()
-	        t0 := ret.Load().CreationTimestamp
 	        select {
 	        case <-pendingCtx.Done():
-	        case <-time.After(time.Until(t0.Add(timeout))):
-	                err := util.PendingBuildError(ctx, podClient, ret.Load())
-	                logrus.Infof(err.Error())
-	                return err
+	        case <-time.After(time.Until(ret.Load().CreationTimestamp.Add(timeout))):
+	                if err := checkPending(ctx, podClient, ret.Load(), timeout, time.Now()); err != nil {
+	                        logrus.Infof(err.Error())
+	                        return err
+	                }
 	        }
 	        return nil
 	}
@@ pkg/steps/source.go: func hintsAtInfraReason(logSnippet string) bool {
 		return kubernetes.WaitForConditionOnObject(ctx, buildClient, ctrlruntimeclient.ObjectKey{Namespace: namespace, Name: name}, &buildapi.BuildList{}, &buildapi.Build{}, func(obj runtime.Object) (bool, error) {
 			build := obj.(*buildapi.Build)
+			first := ret.Swap(build) == nil
 			switch build.Status.Phase {
+			case buildapi.BuildPhaseNew, buildapi.BuildPhasePending:
+				if first {
+					eg.Go(pendingCheck)
+				}
 @@ pkg/steps/source.go: func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name
-			default:
-				if ret.Swap(build) == nil {
-				        eg.Go(pendingCheck)
-				}+func checkPending(
+	ctx context.Context,
+	podClient kubernetes.PodClient,
+	build *buildapi.Build,
+	timeout time.Duration,
+	now time.Time,
+) error {
+	switch build.Status.Phase {
+	case buildapi.BuildPhaseNew, buildapi.BuildPhasePending:
+		if build.CreationTimestamp.Add(timeout).Before(now) {
+			return util.PendingBuildError(ctx, podClient, build)
+		}
+	}
+	return false, nil
+}

@openshift-ci openshift-ci Bot requested review from bear-redhat and jmguzik May 11, 2023 14:00
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2023
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

Ready for review, but let's not deploy it this week.

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2023
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

@bbguimaraes
Copy link
Copy Markdown
Contributor Author

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2023
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

/test e2e

@bbguimaraes bbguimaraes force-pushed the build_pending branch 2 times, most recently from 32aa3d0 to 8c3a30d Compare May 25, 2023 09:51
@danilo-gemoli
Copy link
Copy Markdown
Contributor

/assign

Comment thread pkg/steps/source.go
}

func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name string) error {
func waitForBuild(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is complex, I have read it over and over and every time I find a new corner case I didn't take into the account before. We can't merge it without writing down a (decently) detailed description about what it is supposed to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is as complex as the task it has to perform (and hopefully no more). I've added both documentation and comments detailing everything that is involved in it, based on the questions here.

It may help to contrast it to waitForPodCompletionOrTimeout, on which it is based (a funny fact, since that one performs a much more complex task, and I seem to have got that one correctly but not this one).

Comment thread pkg/steps/source.go
evaluatorFunc := func(obj runtime.Object) (bool, error) {
switch build := obj.(type) {
case *buildapi.Build:
var ret atomic.Pointer[buildapi.Build]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use we make of this pointer is tricky (nasty?): the *buildapi.Build object here could be different from the following line if in the meantime a swap happened due to a change of status. The whole logic is probably correct but is non-trivial. Wouldn't a channel be more appropriate in this case where you are trying to make two goroutines communicate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use we make of this pointer is tricky (nasty?):

The (atomic) pointer provides the latest version of the object as seen by the watch.

The whole logic is probably correct but is non-trivial. Wouldn't a channel be more appropriate in this case where you are trying to make two goroutines communicate?

We could use a channel, but I don't see how that makes things clearer.

  • it would just be a send/receive instead of a store/load
  • it would require an extra boolean variable to know when to start the pending check
  • there is only ever one load from the pointer, so the channel would be single-use

Other than those reasons, this is analogous to the same pattern in the function linked above which watches pods, where the pointer makes things even simpler. I did consider using a channel there, feeding both the watch and the pending check events through it, and processing both streams uniformly. However, those two events are unrelated: we know exactly when and what needs to be performed for the pending check (in both functions), and mixing the treatment in this way makes no sense.

the *buildapi.Build object here could be different from the following line if in the meantime a swap happened due to a change of status.

As desired. The first load is used to determine the point at which the verification needs to be performed, based on the creation timestamp. The second (which happens at a much later point in time) has to perform another load since it must consider the latest version of the object. As mentioned in the description, not doing this was one of the problems in the original implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to quickly modify the code using a channel and it turned out to be even more convoluted so, again (see the comment below), nether I am in favor to go down this path anymore.
The main problem was that I had to rebuild your mental model of the function (see here from Cognitive Load Developer's Handbook), which carry on a lot of hidden details and assumptions, so I was trying to reduce a little bit the complexity starting by eliminating the communication through the atomic pointer.
Comments were probably the real missing piece, I can now "see" the what you intended and feel more relaxed.

Comment thread pkg/steps/source.go
build := obj.(*buildapi.Build)
first := ret.Swap(build) == nil
switch build.Status.Phase {
case buildapi.BuildPhaseNew, buildapi.BuildPhasePending:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why

eg.SetLimit(1)
// ...
switch build.Status.Phase {
case buildapi.BuildPhaseNew:
  eg.Go(pendingCheck)
// ...
}

would not be sufficient/correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand?

  • I don't understand the limit of 1, since the pending check must execute in parallel.
  • We may miss the "new" phase if there is a delay between creating the build and establishing the watch, or if we are a separate execution entirely.
  • This check must be started only once, while we may see several updates in the "new" phase.
  • (If you purposefully did not include the shared pointer) there is no need for a separate get (we used to this for the pod watch) since we've already received the information from the watch and it is guaranteed to be the latest available.

Copy link
Copy Markdown
Contributor

@danilo-gemoli danilo-gemoli May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg.SetLimit(1)

My bad, I should have written eg.SetLimit(2) instead, therefore the whole code would look like

eg.SetLimit(2)
// ...
switch build.Status.Phase {
case buildapi.BuildPhaseNew:
  _ = eg.TryGo(pendingCheck)
// ...
}

but it's maybe more hidden and ambiguous, so no way.
The rest of the explanation makes sense to me, thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense. That would be one way of having an extra flag indicating whether it has already started, as I said above.

Comment thread pkg/steps/source.go
return kubernetes.WaitForConditionOnObject(ctx, buildClient, ctrlruntimeclient.ObjectKey{Namespace: namespace, Name: name}, &buildapi.BuildList{}, &buildapi.Build{}, func(obj runtime.Object) (bool, error) {
build := obj.(*buildapi.Build)
first := ret.Swap(build) == nil
switch build.Status.Phase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right this is more a vague consideration rather than anything else. The --pod-pending-timeout flag states that it controls the:

"Maximum amount of time created containers can spend before the running state"

but this switch does not even mention the state buildapi.BuildPhaseRunning in any place.
This obviously means nothing, but I feel like this logic is a little bit hidden.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the argument's description is not ideal, updated.

@bbguimaraes bbguimaraes force-pushed the build_pending branch 2 times, most recently from a04c152 to 1399292 Compare May 26, 2023 14:03
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

Wow, we enforce even the format of documentation strings?

This implementation is similar (and uses the same time period) as the one in
`WaitForPodCompletion` in `pkg/util/pods.go`, although the pending timeout
verification for builds is considerable simpler since only one time point ---
the creation time --- is considered.

---

Example (with a pending timeout of `1s`):

```yaml
resources:
  src:
    requests:
      memory: 1T
```

```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```

```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending):
Found 1 events for Pod src-build:
* 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate.
```

There is an unfortunate layering violation in that we are forced to get to the
build pod through the `pod-name` annotation and examine it.  I could not find a
way to do this through the `Build` object (as is possible for logs, for
example).  A pending build has very little information:

```
status:
  conditions:
  - lastTransitionTime: "2022-06-17T11:06:41Z"
    lastUpdateTime: "2022-06-17T11:06:41Z"
    status: "False"
    type: New
  - lastTransitionTime: "2022-06-17T11:06:41Z"
    lastUpdateTime: "2022-06-17T11:06:41Z"
    status: "True"
    type: Pending
  output: {}
  outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src
  phase: Pending
```

It is only by examining the build pod (which at least is done by the existing
code used for test pods) that the cause can be determined and reported.
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

/test e2e

Comment thread pkg/steps/source.go
}

func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name string) error {
func waitForBuildOrTimeout(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForBuildOrTimeout - I would expect from the extra OrTimeout suffix that this function is the same as waitForBuild but I accepts an extra timeout time.Duration argument. Is it reasonable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm not changing the name, as the diff output can seem to indicate. What is funny is the original implementation did not have a timeout. It just waited forever, despite its name.

We now do have a timeout parameter, only it's passed implicitly via the podClient (later retrieved by its GetPodPendingTimeout method). I thought about making it explicit, but it didn't seem necessary since we can get it directly from the client object.

@danilo-gemoli
Copy link
Copy Markdown
Contributor

It looks reasonably good to me.
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, danilo-gemoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bbguimaraes,danilo-gemoli]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bbguimaraes
Copy link
Copy Markdown
Contributor Author

We need a /lrgtm command in Prow.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2023

@bbguimaraes: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 1628e67 into openshift:master May 29, 2023
@bbguimaraes bbguimaraes deleted the build_pending branch May 29, 2023 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants