Skip to content

test/helpers: use RetryOnConflict for node writes#2921

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
cgwalters:e2e-node-race
Feb 10, 2022
Merged

test/helpers: use RetryOnConflict for node writes#2921
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
cgwalters:e2e-node-race

Conversation

@cgwalters
Copy link
Copy Markdown
Member

@cgwalters cgwalters commented Jan 20, 2022

I saw this race in a PR:

Error:      	Expected nil, but got: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"Operation cannot be fulfilled on nodes \"ci-op-pjcj27z5-ff089-5jw6w-worker-b-dhsmk\": the object has been modified; please apply your changes to the latest version and try again", Reason:"Conflict", Details:(*v1.StatusDetails)(0xc000450cc0), Code:409}}
Test:       	TestKernelType

A general problem with our tests is that they tend to be written
in "one shot" mode but in cases like this we need to do retries
and reconciliation.

(Actually in this case it'd be better to format a strategic patch
I think, but for now let's just do a standard RetryOnConflict
as is done elsewhere)

I saw this race in a PR:

```
Error:      	Expected nil, but got: &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"Operation cannot be fulfilled on nodes \"ci-op-pjcj27z5-ff089-5jw6w-worker-b-dhsmk\": the object has been modified; please apply your changes to the latest version and try again", Reason:"Conflict", Details:(*v1.StatusDetails)(0xc000450cc0), Code:409}}
Test:       	TestKernelType
```

A general problem with our tests is that they tend to be written
in "one shot" mode but in cases like this we need to do retries
and reconciliation.

(Actually in this case it'd be better to format a strategic patch
 I think, but for now let's just do a standard `RetryOnConflict`
 as is done elsewhere)
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2022
@mkenigs
Copy link
Copy Markdown
Contributor

mkenigs commented Jan 20, 2022

Do we want this against mcbs to help #2886 ?

Just cause I don't understand: how is there a race? Are our e2e tests run in parallel?

@cgwalters
Copy link
Copy Markdown
Member Author

how is there a race? Are our e2e tests run in parallel?

The kubelet running on the node may also update the node object. This code isn't using a strategic merge patch or the (linked from there) newer server side apply.

That means any concurrent modification (e.g. from kubelet) will result in a conflict error as we saw.

I don't think our e2e tests here run in parallel, and in practice we probably don't hit this issue often because I think kubelet doesn't touch the node object very frequently.

@cgwalters
Copy link
Copy Markdown
Member Author

Do we want this against mcbs to help #2886 ?

We could, but I suspect just retrying the tests there will work. If we hit the race again let's try pulling this commit there too. Or perhaps better, merge to main/master and do another PR to rebase mcbs.

@mkenigs
Copy link
Copy Markdown
Contributor

mkenigs commented Jan 20, 2022

Gotcha, thanks! Thanks for the fix the failure was looking pretty mysterious to me

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

Nice, this makes sense to me. 👍
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 1, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 1, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice, mkenigs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@cheesesashimi
Copy link
Copy Markdown
Member

I don't think our e2e tests here run in parallel, and in practice we probably don't hit this issue often because I think kubelet doesn't touch the node object very frequently.

I can confirm that our e2e tests do not run in parallel. To my understanding, Golang will run all tests within a given package sequentially (unless one makes use of t.Parallel() to explicitly mark tests that can be run in parallel). However, it will attempt to run tests from all packages in parallel (modulo the -p flag). Since we invoke the e2e tests and have them confined to a package on their own, they are run sequentially.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

16 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

/skip

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

Since this change only affects our e2e-gcp-op tests (which have long passed), going to override the failing e2e-agnositic-upgrade, since this helper isn't called in that test and wasting retest cycles on that job for this PR seems a bit wasteful. We tried. 😆

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 9, 2022

@kikisdeliveryservice: Overrode contexts on behalf of kikisdeliveryservice: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Since this change only affects our e2e-gcp-op tests (which have long passed), going to override the failing e2e-agnositic-upgrade, since this helper isn't called in that test and wasting retest cycles on that job for this PR seems a bit wasteful. We tried. 😆

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 10, 2022

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-upgrade-single-node 0120107 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-disruptive 0120107 link false /test e2e-aws-disruptive
ci/prow/e2e-vsphere-upgrade 0120107 link false /test e2e-vsphere-upgrade
ci/prow/okd-e2e-aws 0120107 link false /test okd-e2e-aws
ci/prow/e2e-aws-single-node 0120107 link false /test e2e-aws-single-node
ci/prow/e2e-aws-workers-rhel7 0120107 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-workers-rhel8 0120107 link false /test e2e-aws-workers-rhel8

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

ughh why is this retesting if i overrode??

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 10, 2022

@kikisdeliveryservice: Overrode contexts on behalf of kikisdeliveryservice: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

ughh why is this retesting if i overrode??

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 10, 2022

@kikisdeliveryservice: Overrode contexts on behalf of kikisdeliveryservice: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit c8206b9 into openshift:master Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants