ci-operator: retry infra-failed builds immediately#2659
Conversation
`ci-operator` was already able to recognize infrastructure-failed builds from previous runs and retry them. This is an attempt to reuse that code to retry such failed builds immediately, with two attempts in an exponential backoff. The backoff has an intentionally long starting delay of 1 minute to give the infrastructure problem a chance to go away. The way the code is structured makes it less optimal for the case where we are retrying infra failures from the previous executions: it will eat one of the backoff iterations, but such cases should be rare because ci-op runs should not result in failures caused by infrastructure failures anymore (because they are retried immediately).
`Create` populates the `Build`s metadata with resource version, UID etc of the created object, so the same object cannot be subsequently used in a subsequent `Create` when we need to retry. Make local deep copies of the input (desired) `Build` and create them instead of the original. The `Build` is passed into the method as a pointer, which made me wonder whether some callsite actually depends on the side effect on the object, but it does not. To prevent future confusion like that, I changed `handleBuild` to accept an instance copy, not a pointer. If something in the future needs the resulting created `Build` object, the method should return it as a return value.
|
It's very sad that the existing code
But none of those are new or need to block this PR. /lgtm |
|
Maybe someday Go will also advance to 1980s C/++ state of the art and adopt |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bbguimaraes, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-oo |
|
/override e2e-oo |
|
@petr-muller: /override requires a failed status context or a job name to operate on.
Only the following contexts were expected:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/override ci/prow/e2e-oo This is not a new failure (I just happened to trip on a perma-broken conditionally triggered job (which we run as a postsubmit and we ignore it: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/branch-ci-openshift-ci-tools-master-e2e-oo-post) |
|
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-oo DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Previous version caused
resourceVersion should not be set on objects to be createderrors and needed to be reverted. This PR addresses that bug in a new commit. I also included the error handling change from #2655.Createpopulates theBuilds metadata with resource version, UID etc of the created object, so the same object cannot be used in a subsequentCreatewhen we need to retry.Make local deep copies of the input (desired)
Buildand create them instead of the original. TheBuildis passed into the method as a pointer, which made me wonder whether some callsite actually depends on the side effect on the object, but it does not. To prevent future confusion like that, I changedhandleBuildto accept an instance copy, not a pointer. If something in the future needs the resulting createdBuildobject, the method should return it as a return value.ci-operator was already able to recognize infrastructure-failed builds
from previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).
/cc @openshift/test-platform @bbguimaraes @jupierce
/label tide/merge-method-squash