Skip to content

pkg/steps: add build pending timeout period#3390

Merged
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
bbguimaraes:build_pending
May 2, 2023
Merged

pkg/steps: add build pending timeout period#3390
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
bbguimaraes:build_pending

Conversation

@bbguimaraes
Copy link
Copy Markdown
Contributor

@bbguimaraes bbguimaraes commented Apr 21, 2023

@openshift-ci openshift-ci Bot requested review from bear-redhat and droslean April 21, 2023 18:56
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2023
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 24, 2023
Comment thread pkg/kubernetes/client.go
type PodClient interface {
loggingclient.LoggingClient
PendingTimeout() time.Duration
GetPendingTimeout() time.Duration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've understood it correctly this gets the value from --pod-pending-timeout and it controls a pod lifecycle regardless its nature: being it a test pod, a build, etc. Is it enough? Do we need it to be more fine-grained (I mean, having more than one value)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct, this is the same period as is used for pods in general.

The causes for scheduling delays are for the most part the same for all types of pods, so I don't see the need to add a separate value for builds.

@danilo-gemoli
Copy link
Copy Markdown
Contributor

Other than the scary commits history, which could be now simplified as #3385 has been merged, it looks fine overall.

This implementation is similar (and uses the same time period) as the one in
`WaitForPodCompletion` in `pkg/util/pods.go`, although the pending timeout
verification for builds is considerable simpler since only one time point ---
the creation time --- is considered.

---

Example (with a pending timeout of `1s`):

```yaml
resources:
  src:
    requests:
      memory: 1T
```

```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```

```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending):
Found 1 events for Pod src-build:
* 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate.
```

There is an unfortunate layering violation in that we are forced to get to the
build pod through the `pod-name` annotation and examine it.  I could not find a
way to do this through the `Build` object (as is possible for logs, for
example).  A pending build has very little information:

```
status:
  conditions:
  - lastTransitionTime: "2022-06-17T11:06:41Z"
    lastUpdateTime: "2022-06-17T11:06:41Z"
    status: "False"
    type: New
  - lastTransitionTime: "2022-06-17T11:06:41Z"
    lastUpdateTime: "2022-06-17T11:06:41Z"
    status: "True"
    type: Pending
  output: {}
  outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src
  phase: Pending
```

It is only by examining the build pod (which at least is done by the existing
code used for test pods) that the cause can be determined and reported.
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

Rebased.

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 24, 2023
@bbguimaraes
Copy link
Copy Markdown
Contributor Author

Look at the pod pending verification diligently reporting problems, haha:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ci-tools/3390/pull-ci-openshift-ci-tools-master-e2e/1650508305199534080#1:build-log.txt%3A59-64

INFO[2023-04-24T15:04:47Z] pod pending for more than 30s: container "ci-scheduling-dns-wait" has not started in 1m58.8714557s: 
* Container sidecar is not ready with reason PodInitializing
* Container test is not ready with reason PodInitializing
Found 2 events for Pod pending-pending:
* 14x default-scheduler: 0/43 nodes are available: 16 node(s) had untolerated taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, 2 node(s) had untolerated taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 8 Insufficient cpu, 9 node(s) had untolerated taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}. preemption: 0/43 nodes are available: 35 Preemption is not helpful for scheduling, 8 No preemption victims found for incoming pod.
* 1x default-scheduler: Successfully assigned ci-op-pt7l853v/pending-pending to build04-c92hb-ci-tests-worker-b-5gpfp 

We do seem to be waiting too long to report it (~2m vs. 30s), and that period may need adjustment (search shows this happens with some frequency), but the failure is not related to this pull request. On the contrary, it shows exactly the types of problems we want to detect.

@bbguimaraes
Copy link
Copy Markdown
Contributor Author

/test e2e

@bbguimaraes
Copy link
Copy Markdown
Contributor Author

INFO[2023-04-24T15:04:47Z] pod pending for more than 30s: container "ci-scheduling-dns-wait" has not started in 1m58.8714557s: 
…

It may also be good to change the error message to include all pending containers, it's not immediately obvious that the scheduling requirements of one of the (non-init.) containers is the cause of the problem.

@bbguimaraes
Copy link
Copy Markdown
Contributor Author

/test e2e

bbguimaraes added a commit to bbguimaraes/ci-tools that referenced this pull request Apr 26, 2023
This test is meant to cause a pending timeout using a non-existent image.  It
works as intended, but it intentionally uses a short timeout period so that it
executes quickly.  It may happen that the timeout happens for other reasons
(this is after all what the actual verification is meant to detect), leading to
test failures such as the one seen here:

openshift#3390 (comment)

Since the reason for the delay is not relevant for this test, only that it be
detected and properly reported, this change relaxes the error message searched
in the logs such that any pending situation satisfies the test.
bbguimaraes added a commit to bbguimaraes/ci-tools that referenced this pull request Apr 26, 2023
This test is meant to cause a pending timeout using a non-existent image.  It
works as intended, but it intentionally uses a short timeout period so that it
executes quickly.  It may happen that the timeout happens for other reasons
(this is after all what the actual verification is meant to detect), leading to
test failures such as the one seen here:

openshift#3390 (comment)

Since the reason for the delay is not relevant for this test, only that it be
detected and properly reported, this change relaxes the error message searched
in the logs such that any pending situation satisfies the test.
@danilo-gemoli
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 2, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, danilo-gemoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bbguimaraes,danilo-gemoli]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants