ci-operator: retry infra-failed builds immediately by petr-muller · Pull Request #2659 · openshift/ci-tools

petr-muller · 2022-02-09T13:31:02Z

Previous version caused resourceVersion should not be set on objects to be created errors and needed to be reverted. This PR addresses that bug in a new commit. I also included the error handling change from #2655.

Create populates the Builds metadata with resource version, UID etc of the created object, so the same object cannot be used in a subsequent Create when we need to retry.

Make local deep copies of the input (desired) Build and create them instead of the original. The Build is passed into the method as a pointer, which made me wonder whether some callsite actually depends on the side effect on the object, but it does not. To prevent future confusion like that, I changed handleBuild to accept an instance copy, not a pointer. If something in the future needs the resulting created Build object, the method should return it as a return value.

ci-operator was already able to recognize infrastructure-failed builds
from previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).

/cc @openshift/test-platform @bbguimaraes @jupierce
/label tide/merge-method-squash

`ci-operator` was already able to recognize infrastructure-failed builds from previous runs and retry them. This is an attempt to reuse that code to retry such failed builds immediately, with two attempts in an exponential backoff. The backoff has an intentionally long starting delay of 1 minute to give the infrastructure problem a chance to go away. The way the code is structured makes it less optimal for the case where we are retrying infra failures from the previous executions: it will eat one of the backoff iterations, but such cases should be rare because ci-op runs should not result in failures caused by infrastructure failures anymore (because they are retried immediately).

`Create` populates the `Build`s metadata with resource version, UID etc of the created object, so the same object cannot be subsequently used in a subsequent `Create` when we need to retry. Make local deep copies of the input (desired) `Build` and create them instead of the original. The `Build` is passed into the method as a pointer, which made me wonder whether some callsite actually depends on the side effect on the object, but it does not. To prevent future confusion like that, I changed `handleBuild` to accept an instance copy, not a pointer. If something in the future needs the resulting created `Build` object, the method should return it as a return value.

bbguimaraes · 2022-02-09T13:53:28Z

It's very sad that the existing code

discards intermediate results
uses polling
has no tests

But none of those are new or need to block this PR.

/lgtm

bbguimaraes · 2022-02-09T13:53:43Z

Maybe someday Go will also advance to 1980s C/++ state of the art and adopt const pointers.

openshift-ci · 2022-02-09T13:53:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bbguimaraes,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

petr-muller · 2022-02-09T14:07:08Z

/test e2e-oo

petr-muller · 2022-02-09T14:28:33Z

/override e2e-oo

openshift-ci · 2022-02-09T14:29:01Z

@petr-muller: /override requires a failed status context or a job name to operate on.
The following unknown contexts were given:

e2e-oo

Only the following contexts were expected:

ci/prow/breaking-changes
ci/prow/checkconfig
ci/prow/codegen
ci/prow/e2e
ci/prow/e2e-oo
ci/prow/format
ci/prow/frontend-checks
ci/prow/images
ci/prow/integration
ci/prow/lint
ci/prow/secret-bootstrapper-validation
ci/prow/secret-generator-validation
ci/prow/unit
ci/prow/validate-test-infra
ci/prow/validate-vendor
pull-ci-openshift-ci-tools-master-breaking-changes
pull-ci-openshift-ci-tools-master-checkconfig
pull-ci-openshift-ci-tools-master-codegen
pull-ci-openshift-ci-tools-master-e2e
pull-ci-openshift-ci-tools-master-e2e-oo
pull-ci-openshift-ci-tools-master-format
pull-ci-openshift-ci-tools-master-frontend-checks
pull-ci-openshift-ci-tools-master-images
pull-ci-openshift-ci-tools-master-integration
pull-ci-openshift-ci-tools-master-lint
pull-ci-openshift-ci-tools-master-secret-bootstrapper-validation
pull-ci-openshift-ci-tools-master-secret-generator-validation
pull-ci-openshift-ci-tools-master-unit
pull-ci-openshift-ci-tools-master-validate-test-infra
pull-ci-openshift-ci-tools-master-validate-vendor
tide

Details

In response to this:

/override e2e-oo

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2022-02-09T14:30:07Z

/override ci/prow/e2e-oo

This is not a new failure (I just happened to trip on a perma-broken conditionally triggered job (which we run as a postsubmit and we ignore it: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-e2e-oo-post)

openshift-ci · 2022-02-09T14:30:39Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-oo

Details

In response to this:

/override ci/prow/e2e-oo

This is not a new failure (I just happened to trip on a perma-broken conditionally triggered job (which we run as a postsubmit and we ignore it: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-e2e-oo-post)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-02-09T14:30:40Z

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

petr-muller added 3 commits February 9, 2022 13:54

ci-operator: record aggregate error instead last one

85b816f

openshift-ci Bot requested review from a team, bbguimaraes and jupierce February 9, 2022 13:31

openshift-ci Bot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 9, 2022

openshift-ci Bot assigned bbguimaraes Feb 9, 2022

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2022

openshift-merge-robot merged commit 70c1b23 into openshift:master Feb 9, 2022

petr-muller mentioned this pull request Feb 9, 2022

knative-eventing: lower resource requests openshift/release#26101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-operator: retry infra-failed builds immediately#2659

ci-operator: retry infra-failed builds immediately#2659
openshift-merge-robot merged 3 commits into
openshift:masterfrom
petr-muller:robustify-build-retries

petr-muller commented Feb 9, 2022 •

edited

Loading

Uh oh!

bbguimaraes commented Feb 9, 2022

Uh oh!

bbguimaraes commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

petr-muller commented Feb 9, 2022

Uh oh!

petr-muller commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

petr-muller commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

petr-muller commented Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbguimaraes commented Feb 9, 2022

Uh oh!

bbguimaraes commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

petr-muller commented Feb 9, 2022

Uh oh!

petr-muller commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

petr-muller commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

openshift-ci Bot commented Feb 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

petr-muller commented Feb 9, 2022 •

edited

Loading