Skip to content

ci-operator: retry infra-failed builds immediately#2659

Merged
openshift-merge-robot merged 3 commits into
openshift:masterfrom
petr-muller:robustify-build-retries
Feb 9, 2022
Merged

ci-operator: retry infra-failed builds immediately#2659
openshift-merge-robot merged 3 commits into
openshift:masterfrom
petr-muller:robustify-build-retries

Conversation

@petr-muller
Copy link
Copy Markdown
Member

@petr-muller petr-muller commented Feb 9, 2022

Previous version caused resourceVersion should not be set on objects to be created errors and needed to be reverted. This PR addresses that bug in a new commit. I also included the error handling change from #2655.

Create populates the Builds metadata with resource version, UID etc of the created object, so the same object cannot be used in a subsequent Create when we need to retry.

Make local deep copies of the input (desired) Build and create them instead of the original. The Build is passed into the method as a pointer, which made me wonder whether some callsite actually depends on the side effect on the object, but it does not. To prevent future confusion like that, I changed handleBuild to accept an instance copy, not a pointer. If something in the future needs the resulting created Build object, the method should return it as a return value.


ci-operator was already able to recognize infrastructure-failed builds
from previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).

/cc @openshift/test-platform @bbguimaraes @jupierce
/label tide/merge-method-squash

`ci-operator` was already able to recognize infrastructure-failed builds
from previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).
`Create` populates the `Build`s metadata with resource version, UID etc of the created object, so the same object cannot be subsequently used in a subsequent `Create` when we need to retry.

Make local deep copies of the input (desired) `Build` and create them instead of the original. The `Build` is passed into the method as a pointer, which made me wonder whether some callsite actually depends on the side effect on the object, but it does not. To prevent future confusion like that, I changed `handleBuild` to accept an instance copy, not a pointer. If something in the future needs the resulting created `Build` object, the method should return it as a return value.
@openshift-ci openshift-ci Bot requested review from a team, bbguimaraes and jupierce February 9, 2022 13:31
@openshift-ci openshift-ci Bot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 9, 2022
@bbguimaraes
Copy link
Copy Markdown
Contributor

It's very sad that the existing code

  • discards intermediate results
  • uses polling
  • has no tests

But none of those are new or need to block this PR.

/lgtm

@bbguimaraes
Copy link
Copy Markdown
Contributor

Maybe someday Go will also advance to 1980s C/++ state of the art and adopt const pointers.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bbguimaraes,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2022
@petr-muller
Copy link
Copy Markdown
Member Author

/test e2e-oo

@petr-muller
Copy link
Copy Markdown
Member Author

/override e2e-oo

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 9, 2022

@petr-muller: /override requires a failed status context or a job name to operate on.
The following unknown contexts were given:

  • e2e-oo

Only the following contexts were expected:

  • ci/prow/breaking-changes
  • ci/prow/checkconfig
  • ci/prow/codegen
  • ci/prow/e2e
  • ci/prow/e2e-oo
  • ci/prow/format
  • ci/prow/frontend-checks
  • ci/prow/images
  • ci/prow/integration
  • ci/prow/lint
  • ci/prow/secret-bootstrapper-validation
  • ci/prow/secret-generator-validation
  • ci/prow/unit
  • ci/prow/validate-test-infra
  • ci/prow/validate-vendor
  • pull-ci-openshift-ci-tools-master-breaking-changes
  • pull-ci-openshift-ci-tools-master-checkconfig
  • pull-ci-openshift-ci-tools-master-codegen
  • pull-ci-openshift-ci-tools-master-e2e
  • pull-ci-openshift-ci-tools-master-e2e-oo
  • pull-ci-openshift-ci-tools-master-format
  • pull-ci-openshift-ci-tools-master-frontend-checks
  • pull-ci-openshift-ci-tools-master-images
  • pull-ci-openshift-ci-tools-master-integration
  • pull-ci-openshift-ci-tools-master-lint
  • pull-ci-openshift-ci-tools-master-secret-bootstrapper-validation
  • pull-ci-openshift-ci-tools-master-secret-generator-validation
  • pull-ci-openshift-ci-tools-master-unit
  • pull-ci-openshift-ci-tools-master-validate-test-infra
  • pull-ci-openshift-ci-tools-master-validate-vendor
  • tide
Details

In response to this:

/override e2e-oo

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Copy Markdown
Member Author

/override ci/prow/e2e-oo

This is not a new failure (I just happened to trip on a perma-broken conditionally triggered job (which we run as a postsubmit and we ignore it: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-e2e-oo-post)

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 9, 2022

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-oo

Details

In response to this:

/override ci/prow/e2e-oo

This is not a new failure (I just happened to trip on a perma-broken conditionally triggered job (which we run as a postsubmit and we ignore it: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-e2e-oo-post)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 9, 2022

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants