pkg/steps: add error reason for pending pods#2876
pkg/steps: add error reason for pending pods#2876bbguimaraes wants to merge 3 commits intoopenshift:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bbguimaraes The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
372b3f6 to
40a2f62
Compare
This follows the similar implementation (and uses the same time period)
in pkg/steps/template.go.
---
Example (with timeout changed to `1s`):
```yaml
resources:
src:
requests:
memory: 1T
```
```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending):
Found 1 events for Pod src-build:
* 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate.
```
---
There is an unfortunate layering violation here in that we are forced to
get to the build pod through the `pod-name` annotation and examine it.
I could not find a way to do this through the `Build` object (as is
possible for logs, for example). A pending build has very little
information:
```
status:
conditions:
- lastTransitionTime: "2022-06-17T11:06:41Z"
lastUpdateTime: "2022-06-17T11:06:41Z"
status: "False"
type: New
- lastTransitionTime: "2022-06-17T11:06:41Z"
lastUpdateTime: "2022-06-17T11:06:41Z"
status: "True"
type: Pending
output: {}
outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src
phase: Pending
```
It is only by examining the build pod (reusing the existing code which
handles test pods, at least) that the cause can be determined.
Examples: `src` image: ``` INFO[2022-06-20T11:49:39Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:cloning_source:pod_pending' ``` `images` build: ``` INFO[2022-06-20T11:54:26Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:building_project_image:pod_pending' ``` `tests` container: ``` INFO[2022-06-20T11:56:09Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:running_pod:pod_pending' ```
40a2f62 to
f458f5a
Compare
|
@bbguimaraes: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
@bbguimaraes: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
https://issues.redhat.com/browse/DPTP-2836
/hold
Includes #2875.