pkg/steps: add build pending timeout period#3390
pkg/steps: add build pending timeout period#3390openshift-merge-robot merged 4 commits intoopenshift:masterfrom
Conversation
|
/hold |
| type PodClient interface { | ||
| loggingclient.LoggingClient | ||
| PendingTimeout() time.Duration | ||
| GetPendingTimeout() time.Duration |
There was a problem hiding this comment.
If I've understood it correctly this gets the value from --pod-pending-timeout and it controls a pod lifecycle regardless its nature: being it a test pod, a build, etc. Is it enough? Do we need it to be more fine-grained (I mean, having more than one value)?
There was a problem hiding this comment.
Yes, that is correct, this is the same period as is used for pods in general.
The causes for scheduling delays are for the most part the same for all types of pods, so I don't see the need to add a separate value for builds.
|
Other than the scary commits history, which could be now simplified as #3385 has been merged, it looks fine overall. |
This implementation is similar (and uses the same time period) as the one in
`WaitForPodCompletion` in `pkg/util/pods.go`, although the pending timeout
verification for builds is considerable simpler since only one time point ---
the creation time --- is considered.
---
Example (with a pending timeout of `1s`):
```yaml
resources:
src:
requests:
memory: 1T
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending)
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending):
Found 1 events for Pod src-build:
* 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate.
```
There is an unfortunate layering violation in that we are forced to get to the
build pod through the `pod-name` annotation and examine it. I could not find a
way to do this through the `Build` object (as is possible for logs, for
example). A pending build has very little information:
```
status:
conditions:
- lastTransitionTime: "2022-06-17T11:06:41Z"
lastUpdateTime: "2022-06-17T11:06:41Z"
status: "False"
type: New
- lastTransitionTime: "2022-06-17T11:06:41Z"
lastUpdateTime: "2022-06-17T11:06:41Z"
status: "True"
type: Pending
output: {}
outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src
phase: Pending
```
It is only by examining the build pod (which at least is done by the existing
code used for test pods) that the cause can be determined and reported.
8887281 to
c890b63
Compare
|
Rebased. /hold cancel |
|
Look at the pod pending verification diligently reporting problems, haha: We do seem to be waiting too long to report it (~2m vs. 30s), and that period may need adjustment ( |
|
/test e2e |
It may also be good to change the error message to include all pending containers, it's not immediately obvious that the scheduling requirements of one of the (non-init.) containers is the cause of the problem. |
|
/test e2e |
This test is meant to cause a pending timeout using a non-existent image. It works as intended, but it intentionally uses a short timeout period so that it executes quickly. It may happen that the timeout happens for other reasons (this is after all what the actual verification is meant to detect), leading to test failures such as the one seen here: openshift#3390 (comment) Since the reason for the delay is not relevant for this test, only that it be detected and properly reported, this change relaxes the error message searched in the logs such that any pending situation satisfies the test.
This test is meant to cause a pending timeout using a non-existent image. It works as intended, but it intentionally uses a short timeout period so that it executes quickly. It may happen that the timeout happens for other reasons (this is after all what the actual verification is meant to detect), leading to test failures such as the one seen here: openshift#3390 (comment) Since the reason for the delay is not relevant for this test, only that it be detected and properly reported, this change relaxes the error message searched in the logs such that any pending situation satisfies the test.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bbguimaraes, danilo-gemoli The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
#2875
https://issues.redhat.com/browse/DPTP-2836
Based on #3385.
/hold