pkg/steps: add build pending timeout period by bbguimaraes · Pull Request #3427 · openshift/ci-tools

bbguimaraes · 2023-05-11T13:56:49Z

Previous attempt: #3390.

This version fixes a couple of problems with the previous implementation:

The Build pointer needs to be updated every time a new object is seen.
The pending check was omitted when the timer expired, which was very
unhelpful.

This resulted in builds being rejected if their execution exceeded the pending
timeout period:

$ yq '.images[]|select(.to=="sleep")' config.yaml
{
  "to": "sleep",
  "from": "root",
  "dockerfile_literal": "FROM root\nRUN sleep 1m\n"
}

$ git log -n 1 --oneline
614711426 (HEAD) Merge pull request #3390 from bbguimaraes/build_pending
$ go install ./cmd/ci-operator
$ ci-operator --config config.yaml --pod-pending-timeout 30s --target sleep …
…
INFO[2023-05-11T12:48:16Z] Created build "sleep"
INFO[2023-05-11T12:48:46Z] build didn't start running within 30s (phase: Running):
…

$ git log -n 1 --oneline
ef4000f84 (HEAD) pkg/steps: add pod events to build pending error
$ go install ./cmd/ci-operator
$ ci-operator --config config.yaml --pod-pending-timeout 30s --target sleep …
…
INFO[2023-05-11T12:51:53Z] Building sleep
INFO[2023-05-11T12:51:53Z] Created build "sleep"
INFO[2023-05-11T12:53:34Z] Build sleep succeeded after 1m41s
INFO[2023-05-11T12:53:35Z] Ran for 1m47s

This pull request contains changed versions of the original commits so that it
does not introduce (more) revisions with problems. Here is a range-diff-based
comparison of the previous and current versions, edited for ease of review:

@@ pkg/steps/source.go: func hintsAtInfraReason(logSnippet string) bool {
 	pendingCheck := func() error {
 	        timeout := podClient.GetPendingTimeout()
-	        t0 := ret.Load().CreationTimestamp
 	        select {
 	        case <-pendingCtx.Done():
-	        case <-time.After(time.Until(t0.Add(timeout))):
-	                err := util.PendingBuildError(ctx, podClient, ret.Load())
-	                logrus.Infof(err.Error())
-	                return err
+	        case <-time.After(time.Until(ret.Load().CreationTimestamp.Add(timeout))):
+	                if err := checkPending(ctx, podClient, ret.Load(), timeout, time.Now()); err != nil {
+	                        logrus.Infof(err.Error())
+	                        return err
+	                }
 	        }
 	        return nil
 	}
@@ pkg/steps/source.go: func hintsAtInfraReason(logSnippet string) bool {
 		return kubernetes.WaitForConditionOnObject(ctx, buildClient, ctrlruntimeclient.ObjectKey{Namespace: namespace, Name: name}, &buildapi.BuildList{}, &buildapi.Build{}, func(obj runtime.Object) (bool, error) {
 			build := obj.(*buildapi.Build)
+			first := ret.Swap(build) == nil
 			switch build.Status.Phase {
+			case buildapi.BuildPhaseNew, buildapi.BuildPhasePending:
+				if first {
+					eg.Go(pendingCheck)
+				}
 @@ pkg/steps/source.go: func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name
-			default:
-				if ret.Swap(build) == nil {
-				        eg.Go(pendingCheck)
-				}
…
+func checkPending(
+	ctx context.Context,
+	podClient kubernetes.PodClient,
+	build *buildapi.Build,
+	timeout time.Duration,
+	now time.Time,
+) error {
+	switch build.Status.Phase {
+	case buildapi.BuildPhaseNew, buildapi.BuildPhasePending:
+		if build.CreationTimestamp.Add(timeout).Before(now) {
+			return util.PendingBuildError(ctx, podClient, build)
+		}
+	}
+	return false, nil
+}

bbguimaraes · 2023-05-11T14:01:50Z

Ready for review, but let's not deploy it this week.

/hold

bbguimaraes · 2023-05-11T14:41:24Z

https://issues.redhat.com/browse/DPTP-2836

bbguimaraes · 2023-05-18T08:08:01Z

/hold cancel

bbguimaraes · 2023-05-19T11:35:48Z

/test e2e

danilo-gemoli · 2023-05-25T14:57:48Z

/assign

danilo-gemoli · 2023-05-25T15:35:25Z

 }

-func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name string) error {
+func waitForBuild(


This function is complex, I have read it over and over and every time I find a new corner case I didn't take into the account before. We can't merge it without writing down a (decently) detailed description about what it is supposed to do.

It is as complex as the task it has to perform (and hopefully no more). I've added both documentation and comments detailing everything that is involved in it, based on the questions here.

It may help to contrast it to waitForPodCompletionOrTimeout, on which it is based (a funny fact, since that one performs a much more complex task, and I seem to have got that one correctly but not this one).

danilo-gemoli · 2023-05-25T15:46:23Z

-	evaluatorFunc := func(obj runtime.Object) (bool, error) {
-		switch build := obj.(type) {
-		case *buildapi.Build:
+	var ret atomic.Pointer[buildapi.Build]


The use we make of this pointer is tricky (nasty?): the *buildapi.Build object here could be different from the following line if in the meantime a swap happened due to a change of status. The whole logic is probably correct but is non-trivial. Wouldn't a channel be more appropriate in this case where you are trying to make two goroutines communicate?

The use we make of this pointer is tricky (nasty?):

The (atomic) pointer provides the latest version of the object as seen by the watch.

The whole logic is probably correct but is non-trivial. Wouldn't a channel be more appropriate in this case where you are trying to make two goroutines communicate?

We could use a channel, but I don't see how that makes things clearer.

it would just be a send/receive instead of a store/load

it would require an extra boolean variable to know when to start the pending check

there is only ever one load from the pointer, so the channel would be single-use

Other than those reasons, this is analogous to the same pattern in the function linked above which watches pods, where the pointer makes things even simpler. I did consider using a channel there, feeding both the watch and the pending check events through it, and processing both streams uniformly. However, those two events are unrelated: we know exactly when and what needs to be performed for the pending check (in both functions), and mixing the treatment in this way makes no sense.

the *buildapi.Build object here could be different from the following line if in the meantime a swap happened due to a change of status.

As desired. The first load is used to determine the point at which the verification needs to be performed, based on the creation timestamp. The second (which happens at a much later point in time) has to perform another load since it must consider the latest version of the object. As mentioned in the description, not doing this was one of the problems in the original implementation.

I have tried to quickly modify the code using a channel and it turned out to be even more convoluted so, again (see the comment below), nether I am in favor to go down this path anymore.
The main problem was that I had to rebuild your mental model of the function (see here from Cognitive Load Developer's Handbook), which carry on a lot of hidden details and assumptions, so I was trying to reduce a little bit the complexity starting by eliminating the communication through the atomic pointer.
Comments were probably the real missing piece, I can now "see" the what you intended and feel more relaxed.

danilo-gemoli · 2023-05-25T15:51:24Z

+			build := obj.(*buildapi.Build)
+			first := ret.Swap(build) == nil
 			switch build.Status.Phase {
+			case buildapi.BuildPhaseNew, buildapi.BuildPhasePending:


Why

eg.SetLimit(1) // ... switch build.Status.Phase { case buildapi.BuildPhaseNew: eg.Go(pendingCheck) // ... }

would not be sufficient/correct?

Can you expand?

I don't understand the limit of 1, since the pending check must execute in parallel.

We may miss the "new" phase if there is a delay between creating the build and establishing the watch, or if we are a separate execution entirely.

This check must be started only once, while we may see several updates in the "new" phase.

(If you purposefully did not include the shared pointer) there is no need for a separate get (we used to this for the pod watch) since we've already received the information from the watch and it is guaranteed to be the latest available.

~~eg.SetLimit(1)~~

My bad, I should have written eg.SetLimit(2) instead, therefore the whole code would look like

eg.SetLimit(2) // ... switch build.Status.Phase { case buildapi.BuildPhaseNew: _ = eg.TryGo(pendingCheck) // ... }

but it's maybe more hidden and ambiguous, so no way.
The rest of the explanation makes sense to me, thanks.

I see, that makes sense. That would be one way of having an extra flag indicating whether it has already started, as I said above.

danilo-gemoli · 2023-05-25T16:06:47Z

+		return kubernetes.WaitForConditionOnObject(ctx, buildClient, ctrlruntimeclient.ObjectKey{Namespace: namespace, Name: name}, &buildapi.BuildList{}, &buildapi.Build{}, func(obj runtime.Object) (bool, error) {
+			build := obj.(*buildapi.Build)
+			first := ret.Swap(build) == nil
 			switch build.Status.Phase {


All right this is more a vague consideration rather than anything else. The --pod-pending-timeout flag states that it controls the:

"Maximum amount of time created containers can spend before the running state"

but this switch does not even mention the state buildapi.BuildPhaseRunning in any place.
This obviously means nothing, but I feel like this logic is a little bit hidden.

I agree the argument's description is not ideal, updated.

bbguimaraes · 2023-05-26T14:04:59Z

Wow, we enforce even the format of documentation strings?

This implementation is similar (and uses the same time period) as the one in `WaitForPodCompletion` in `pkg/util/pods.go`, although the pending timeout verification for builds is considerable simpler since only one time point --- the creation time --- is considered. --- Example (with a pending timeout of `1s`): ```yaml resources: src: requests: memory: 1T ``` ``` INFO[2022-06-17T18:22:59Z] Building src INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending) ```

``` INFO[2022-06-17T18:22:59Z] Building src INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending) ``` ``` INFO[2022-06-17T18:22:59Z] Building src INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending): Found 1 events for Pod src-build: * 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate. ``` There is an unfortunate layering violation in that we are forced to get to the build pod through the `pod-name` annotation and examine it. I could not find a way to do this through the `Build` object (as is possible for logs, for example). A pending build has very little information: ``` status: conditions: - lastTransitionTime: "2022-06-17T11:06:41Z" lastUpdateTime: "2022-06-17T11:06:41Z" status: "False" type: New - lastTransitionTime: "2022-06-17T11:06:41Z" lastUpdateTime: "2022-06-17T11:06:41Z" status: "True" type: Pending output: {} outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src phase: Pending ``` It is only by examining the build pod (which at least is done by the existing code used for test pods) that the cause can be determined and reported.

bbguimaraes · 2023-05-26T14:42:06Z

/test e2e

danilo-gemoli · 2023-05-29T08:50:39Z

-}
-
-func waitForBuild(ctx context.Context, buildClient BuildClient, namespace, name string) error {
+func waitForBuildOrTimeout(


waitForBuildOrTimeout - I would expect from the extra OrTimeout suffix that this function is the same as waitForBuild but I accepts an extra timeout time.Duration argument. Is it reasonable?

Note that I'm not changing the name, as the diff output can seem to indicate. What is funny is the original implementation did not have a timeout. It just waited forever, despite its name.

We now do have a timeout parameter, only it's passed implicitly via the podClient (later retrieved by its GetPodPendingTimeout method). I thought about making it explicit, but it didn't seem necessary since we can get it directly from the client object.

danilo-gemoli · 2023-05-29T14:27:50Z

It looks reasonably good to me.
/lgtm

openshift-ci · 2023-05-29T14:28:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, danilo-gemoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bbguimaraes,danilo-gemoli]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bbguimaraes · 2023-05-29T14:38:33Z

We need a /lrgtm command in Prow.

openshift-ci · 2023-05-29T15:13:36Z

@bbguimaraes: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci Bot requested review from bear-redhat and jmguzik May 11, 2023 14:00

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2023

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2023

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2023

bbguimaraes force-pushed the build_pending branch from 8f00f3e to 629be48 Compare May 18, 2023 13:37

bbguimaraes force-pushed the build_pending branch 2 times, most recently from 32aa3d0 to 8c3a30d Compare May 25, 2023 09:51

openshift-ci Bot assigned danilo-gemoli May 25, 2023

danilo-gemoli reviewed May 25, 2023

View reviewed changes

bbguimaraes force-pushed the build_pending branch 2 times, most recently from a04c152 to 1399292 Compare May 26, 2023 14:03

bbguimaraes added 4 commits May 26, 2023 14:05

pkg/steps: add pod clients to steps with builds

b0cb0cc

pkg/testhelper: add configurable pending timeout

86803b4

bbguimaraes force-pushed the build_pending branch from 1399292 to 2965020 Compare May 26, 2023 14:05

danilo-gemoli reviewed May 29, 2023

View reviewed changes

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2023

openshift-merge-robot merged commit 1628e67 into openshift:master May 29, 2023

bbguimaraes deleted the build_pending branch May 29, 2023 15:19

Conversation

bbguimaraes commented May 11, 2023

Uh oh!

bbguimaraes commented May 11, 2023

Uh oh!

bbguimaraes commented May 11, 2023

Uh oh!

bbguimaraes commented May 18, 2023

Uh oh!

bbguimaraes commented May 19, 2023

Uh oh!

danilo-gemoli commented May 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danilo-gemoli May 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbguimaraes commented May 26, 2023

Uh oh!

bbguimaraes commented May 26, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danilo-gemoli commented May 29, 2023

Uh oh!

openshift-ci Bot commented May 29, 2023

Uh oh!

bbguimaraes commented May 29, 2023

Uh oh!

openshift-ci Bot commented May 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danilo-gemoli May 29, 2023 •

edited

Loading