Skip to content

pkg/steps: add error reason for pending pods#2876

Closed
bbguimaraes wants to merge 3 commits intoopenshift:masterfrom
bbguimaraes:pending_reason
Closed

pkg/steps: add error reason for pending pods#2876
bbguimaraes wants to merge 3 commits intoopenshift:masterfrom
bbguimaraes:pending_reason

Conversation

@bbguimaraes
Copy link
Copy Markdown
Contributor

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2022
@openshift-ci openshift-ci Bot requested review from hongkailiu and smg247 June 20, 2022 11:59
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 20, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2022
This follows the similar implementation (and uses the same time period)
in pkg/steps/template.go.

---

Example (with timeout changed to `1s`):

```yaml
resources:
  src:
    requests:
      memory: 1T
```

```
INFO[2022-06-17T18:22:59Z] Building src
INFO[2022-06-17T18:23:05Z] build didn't start running within 1s (phase: Pending):
Found 1 events for Pod src-build:
* 0x : 0/23 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}, that the pod didn't tolerate, 1 node(s) had taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Insufficient memory, 2 node(s) had taint {ci.openshift.io/ci-search: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, that the pod didn't tolerate, 6 node(s) had taint {node-role.kubernetes.io/ci-tests-worker: ci-tests-worker}, that the pod didn't tolerate.
```

---

There is an unfortunate layering violation here in that we are forced to
get to the build pod through the `pod-name` annotation and examine it.
I could not find a way to do this through the `Build` object (as is
possible for logs, for example).  A pending build has very little
information:

```
status:
  conditions:
  - lastTransitionTime: "2022-06-17T11:06:41Z"
    lastUpdateTime: "2022-06-17T11:06:41Z"
    status: "False"
    type: New
  - lastTransitionTime: "2022-06-17T11:06:41Z"
    lastUpdateTime: "2022-06-17T11:06:41Z"
    status: "True"
    type: Pending
  output: {}
  outputDockerImageReference: image-registry.openshift-image-registry.svc:5000/bbguimaraes0/pipeline:src
  phase: Pending
```

It is only by examining the build pod (reusing the existing code which
handles test pods, at least) that the cause can be determined.
Examples:

`src` image:

```
INFO[2022-06-20T11:49:39Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:cloning_source:pod_pending'
```

`images` build:

```
INFO[2022-06-20T11:54:26Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:building_project_image:pod_pending'
```

`tests` container:

```
INFO[2022-06-20T11:56:09Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:running_pod:pod_pending'
```
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Aug 8, 2022

@bbguimaraes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-oo f458f5a link true /test e2e-oo

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 7, 2022
@openshift-merge-robot
Copy link
Copy Markdown
Contributor

@bbguimaraes: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Copy Markdown
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 7, 2022
@openshift-bot
Copy link
Copy Markdown
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci Bot closed this Jan 7, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jan 7, 2023

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants