pkg/steps: add build pending timeout period by bbguimaraes · Pull Request #3390 · openshift/ci-tools

bbguimaraes · 2023-04-21T18:55:47Z

#2875
https://issues.redhat.com/browse/DPTP-2836

Based on #3385.

/hold

bbguimaraes · 2023-04-24T09:29:08Z

/hold

danilo-gemoli · 2023-04-24T10:16:57Z

 type PodClient interface {
 	loggingclient.LoggingClient
-	PendingTimeout() time.Duration
+	GetPendingTimeout() time.Duration


If I've understood it correctly this gets the value from --pod-pending-timeout and it controls a pod lifecycle regardless its nature: being it a test pod, a build, etc. Is it enough? Do we need it to be more fine-grained (I mean, having more than one value)?

Yes, that is correct, this is the same period as is used for pods in general.

The causes for scheduling delays are for the most part the same for all types of pods, so I don't see the need to add a separate value for builds.

danilo-gemoli · 2023-04-24T10:21:47Z

Other than the scary commits history, which could be now simplified as #3385 has been merged, it looks fine overall.

This implementation is similar (and uses the same time period) as the one in `WaitForPodCompletion` in `pkg/util/pods.go`, although the pending timeout verification for builds is considerable simpler since only one time point --- the creation time --- is considered. --- Example (with a pending timeout of `1s`): ```yaml resources: src: requests: memory: 1T ``` ``` INFO[2022-06-17T18:22:59Z] Building src INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending) ```

``` INFO[2022-06-17T18:22:59Z] Building src INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending) ``` ``` INFO[2022-06-17T18:22:59Z] Building src INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending): Found 1 events for Pod src-build: * 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate. ``` There is an unfortunate layering violation in that we are forced to get to the build pod through the `pod-name` annotation and examine it. I could not find a way to do this through the `Build` object (as is possible for logs, for example). A pending build has very little information: ``` status: conditions: - lastTransitionTime: "2022-06-17T11:06:41Z" lastUpdateTime: "2022-06-17T11:06:41Z" status: "False" type: New - lastTransitionTime: "2022-06-17T11:06:41Z" lastUpdateTime: "2022-06-17T11:06:41Z" status: "True" type: Pending output: {} outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src phase: Pending ``` It is only by examining the build pod (which at least is done by the existing code used for test pods) that the cause can be determined and reported.

bbguimaraes · 2023-04-24T14:42:55Z

Rebased.

/hold cancel

bbguimaraes · 2023-04-24T16:01:23Z

Look at the pod pending verification diligently reporting problems, haha:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ci-tools/3390/pull-ci-openshift-ci-tools-master-e2e/1650508305199534080#1:build-log.txt%3A59-64

INFO[2023-04-24T15:04:47Z] pod pending for more than 30s: container "ci-scheduling-dns-wait" has not started in 1m58.8714557s: 
* Container sidecar is not ready with reason PodInitializing
* Container test is not ready with reason PodInitializing
Found 2 events for Pod pending-pending:
* 14x default-scheduler: 0/43 nodes are available: 16 node(s) had untolerated taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, 2 node(s) had untolerated taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 8 Insufficient cpu, 9 node(s) had untolerated taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}. preemption: 0/43 nodes are available: 35 Preemption is not helpful for scheduling, 8 No preemption victims found for incoming pod.
* 1x default-scheduler: Successfully assigned ci-op-pt7l853v/pending-pending to build04-c92hb-ci-tests-worker-b-5gpfp

We do seem to be waiting too long to report it (~2m vs. 30s), and that period may need adjustment (search shows this happens with some frequency), but the failure is not related to this pull request. On the contrary, it shows exactly the types of problems we want to detect.

bbguimaraes · 2023-04-24T16:01:39Z

/test e2e

bbguimaraes · 2023-04-24T16:04:07Z

INFO[2023-04-24T15:04:47Z] pod pending for more than 30s: container "ci-scheduling-dns-wait" has not started in 1m58.8714557s: 
…

It may also be good to change the error message to include all pending containers, it's not immediately obvious that the scheduling requirements of one of the (non-init.) containers is the cause of the problem.

bbguimaraes · 2023-04-25T09:47:03Z

/test e2e

This test is meant to cause a pending timeout using a non-existent image. It works as intended, but it intentionally uses a short timeout period so that it executes quickly. It may happen that the timeout happens for other reasons (this is after all what the actual verification is meant to detect), leading to test failures such as the one seen here: openshift#3390 (comment) Since the reason for the delay is not relevant for this test, only that it be detected and properly reported, this change relaxes the error message searched in the logs such that any pending situation satisfies the test.

danilo-gemoli · 2023-05-02T07:09:57Z

/lgtm

openshift-ci · 2023-05-02T07:10:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, danilo-gemoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bbguimaraes,danilo-gemoli]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci Bot requested review from bear-redhat and droslean April 21, 2023 18:56

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2023

bbguimaraes mentioned this pull request Apr 21, 2023

pkg/steps: add error reason for pending pods #3391

Merged

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 24, 2023

danilo-gemoli reviewed Apr 24, 2023

View reviewed changes

bbguimaraes added 4 commits April 24, 2023 12:17

pkg/steps: add pod clients to steps with builds

42f0937

pkg/testhelper: add configurable pending timeout

3296208

bbguimaraes force-pushed the build_pending branch from 8887281 to c890b63 Compare April 24, 2023 14:33

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 24, 2023

bbguimaraes mentioned this pull request Apr 26, 2023

test/e2e: relax pending test check #3398

Merged

openshift-ci Bot assigned danilo-gemoli May 2, 2023

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2023

openshift-merge-robot merged commit 6147114 into openshift:master May 2, 2023

bbguimaraes deleted the build_pending branch May 3, 2023 10:06

bbguimaraes mentioned this pull request May 3, 2023

Revert "pkg/steps: add build pending timeout period" #3409

Merged

bbguimaraes mentioned this pull request May 11, 2023

pkg/steps: add build pending timeout period #3427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/steps: add build pending timeout period#3390

pkg/steps: add build pending timeout period#3390
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
bbguimaraes:build_pending

bbguimaraes commented Apr 21, 2023 •

edited

Loading

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

danilo-gemoli Apr 24, 2023

Uh oh!

bbguimaraes Apr 24, 2023

Uh oh!

danilo-gemoli commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 25, 2023

Uh oh!

danilo-gemoli commented May 2, 2023

Uh oh!

openshift-ci Bot commented May 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bbguimaraes commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

danilo-gemoli Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

bbguimaraes Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

danilo-gemoli commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 24, 2023

Uh oh!

bbguimaraes commented Apr 25, 2023

Uh oh!

danilo-gemoli commented May 2, 2023

Uh oh!

openshift-ci Bot commented May 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bbguimaraes commented Apr 21, 2023 •

edited

Loading