Skip to content

ci-operator: retry infra-failed builds immediately#2648

Merged
openshift-merge-robot merged 1 commit into
openshift:masterfrom
petr-muller:retry-builds-in-place
Feb 8, 2022
Merged

ci-operator: retry infra-failed builds immediately#2648
openshift-merge-robot merged 1 commit into
openshift:masterfrom
petr-muller:retry-builds-in-place

Conversation

@petr-muller
Copy link
Copy Markdown
Member

ci-operator was already able to recognize infrastructure-failed builds
from previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).

/cc @bbguimaraes @jupierce @openshift/test-platform

@openshift-ci openshift-ci Bot requested review from a team, bbguimaraes and jupierce February 7, 2022 19:09
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2022
Copy link
Copy Markdown
Contributor

@jupierce jupierce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought. Otherwise, looks great!

Comment thread pkg/steps/source.go Outdated
return fmt.Errorf("could not create build %s: %w", build.Name, err)
var buildErr error
attempts := 3
if boErr := wait.ExponentialBackoff(wait.Backoff{Duration: 1 * time.Minute, Factor: 2, Steps: attempts}, func() (bool, error) {
Copy link
Copy Markdown
Contributor

@jupierce jupierce Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it takes about 5 minutes for a new node to come online. This backoff gives us a chance at one new node arriving in time. Given the low cost of the kubelet rejecting us, I'd consider attempts=5 and factor=1.5 which would give us a few chances for new nodes joining the back.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always found the apimachinery parameters confusing, but note that these values will give us:

n, d, f, t = 5, 1, 1.5, 0
print("i", "t", "d")
for i = 0, n - 1 do
    print(i, t, d)
    t, d = t + d, d * f
end
i	t	d
0	0	1
1	1	1.5
2	2.5	2.25
3	4.75	3.375
4	8.125	5.0625

Which will give us a single retry (the last) after a new node arrives at t ~ 5, unless the combined retries take more than 1.875s (I don't know how long they take and could not find an example).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we do not necessarily need to retry until there is a new node - that's scheduler's job. We just need to give the scheduler some chance to retry the bad placement that results in OutOfCpu. It is fine if the retried Pod ends up unschedulable, that makes it wait for the node.

`ci-operator` was already able to recognize infrastructure-failed builds
from previous runs and retry them. This is an attempt to reuse that code
to retry such failed builds immediately, with two attempts in an
exponential backoff. The backoff has an intentionally long starting
delay of 1 minute to give the infrastructure problem a chance to go
away. The way the code is structured makes it less optimal for the case
where we are retrying infra failures from the previous executions: it
will eat one of the backoff iterations, but such cases should be rare
because ci-op runs should not result in failures caused by
infrastructure failures anymore (because they are retried immediately).
@petr-muller petr-muller force-pushed the retry-builds-in-place branch from efee15c to c056a8c Compare February 7, 2022 21:03
@petr-muller
Copy link
Copy Markdown
Member Author

/test integration

Comment thread pkg/steps/source.go
return false, nil
}); boErr != nil {
if boErr == wait.ErrWaitTimeout {
return fmt.Errorf("build not successful after %d attempts, last error: %w", attempts, buildErr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be more informative to wrap an aggregate with all errors, instead of just the last one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will address

@jupierce
Copy link
Copy Markdown
Contributor

jupierce commented Feb 8, 2022

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 8, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jupierce, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 8, 2022

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@petr-muller
Copy link
Copy Markdown
Member Author

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2022
@openshift-merge-robot openshift-merge-robot merged commit 543ca36 into openshift:master Feb 8, 2022
@petr-muller
Copy link
Copy Markdown
Member Author

🖕 Tide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants