ci-operator: retry infra-failed builds immediately by petr-muller · Pull Request #2648 · openshift/ci-tools

petr-muller · 2022-02-07T19:09:00Z

ci-operator was already able to recognize infrastructure-failed builds
from previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).

/cc @bbguimaraes @jupierce @openshift/test-platform

jupierce

One thought. Otherwise, looks great!

jupierce · 2022-02-07T20:38:56Z

-			return fmt.Errorf("could not create build %s: %w", build.Name, err)
+	var buildErr error
+	attempts := 3
+	if boErr := wait.ExponentialBackoff(wait.Backoff{Duration: 1 * time.Minute, Factor: 2, Steps: attempts}, func() (bool, error) {


Seems like it takes about 5 minutes for a new node to come online. This backoff gives us a chance at one new node arriving in time. Given the low cost of the kubelet rejecting us, I'd consider attempts=5 and factor=1.5 which would give us a few chances for new nodes joining the back.

Sounds good!

I've always found the apimachinery parameters confusing, but note that these values will give us:

n, d, f, t = 5, 1, 1.5, 0 print("i", "t", "d") for i = 0, n - 1 do print(i, t, d) t, d = t + d, d * f end

i t d 0 0 1 1 1 1.5 2 2.5 2.25 3 4.75 3.375 4 8.125 5.0625

Which will give us a single retry (the last) after a new node arrives at t ~ 5, unless the combined retries take more than 1.875s (I don't know how long they take and could not find an example).

I think that we do not necessarily need to retry until there is a new node - that's scheduler's job. We just need to give the scheduler some chance to retry the bad placement that results in OutOfCpu. It is fine if the retried Pod ends up unschedulable, that makes it wait for the node.

`ci-operator` was already able to recognize infrastructure-failed builds from previous runs and retry them. This is an attempt to reuse that code to retry such failed builds immediately, with two attempts in an exponential backoff. The backoff has an intentionally long starting delay of 1 minute to give the infrastructure problem a chance to go away. The way the code is structured makes it less optimal for the case where we are retrying infra failures from the previous executions: it will eat one of the backoff iterations, but such cases should be rare because ci-op runs should not result in failures caused by infrastructure failures anymore (because they are retried immediately).

petr-muller · 2022-02-08T08:56:43Z

/test integration

bbguimaraes · 2022-02-08T12:45:13Z

+		return false, nil
+	}); boErr != nil {
+		if boErr == wait.ErrWaitTimeout {
+			return fmt.Errorf("build not successful after %d attempts, last error: %w", attempts, buildErr)


May be more informative to wrap an aggregate with all errors, instead of just the last one.

Good point, will address

jupierce · 2022-02-08T16:01:14Z

/lgtm

openshift-ci · 2022-02-08T16:09:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jupierce, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-02-08T16:33:57Z

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

petr-muller · 2022-02-08T16:34:00Z

/hold

petr-muller · 2022-02-08T16:34:31Z

🖕 Tide

openshift-ci Bot requested review from a team, bbguimaraes and jupierce February 7, 2022 19:09

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2022

jupierce reviewed Feb 7, 2022

View reviewed changes

petr-muller force-pushed the retry-builds-in-place branch from efee15c to c056a8c Compare February 7, 2022 21:03

bbguimaraes reviewed Feb 8, 2022

View reviewed changes

openshift-ci Bot assigned jupierce Feb 8, 2022

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2022

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2022

openshift-merge-robot merged commit 543ca36 into openshift:master Feb 8, 2022

petr-muller mentioned this pull request Feb 8, 2022

ci-operator: record aggregate error instead last one #2655

Closed

petr-muller deleted the retry-builds-in-place branch February 8, 2022 18:25

petr-muller mentioned this pull request Feb 8, 2022

Revert "ci-operator: retry infra-failed builds immediately" #2656

Merged

Conversation

petr-muller commented Feb 7, 2022

Uh oh!

jupierce left a comment

Choose a reason for hiding this comment

Uh oh!

jupierce Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petr-muller Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

bbguimaraes Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

petr-muller Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

petr-muller commented Feb 8, 2022

Uh oh!

bbguimaraes Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

petr-muller Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

jupierce commented Feb 8, 2022

Uh oh!

openshift-ci Bot commented Feb 8, 2022

Uh oh!

openshift-ci Bot commented Feb 8, 2022

Uh oh!

petr-muller commented Feb 8, 2022

Uh oh!

petr-muller commented Feb 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jupierce Feb 7, 2022 •

edited

Loading