Skip to content

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Nov 2, 2022

This eliminates flexibility in how tests are started, inlines anonyous functions, and attempts to build a pipeline of data-in to data-out.

It got pretty big and I haven't run it locally yet. need to prove the correct number of tests run and that certain tests are run serially.

TBH, I don't know if it's easier to review as a diff or easier to review as new code for running tests.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 2, 2022
@openshift-ci openshift-ci bot requested review from csrwng and spadgett November 2, 2022 20:04
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 2, 2022
@deads2k deads2k force-pushed the refactor-launch-01 branch from b0c07b9 to a4383d6 Compare November 2, 2022 23:19
@deads2k deads2k changed the title [wip] refactor how tests are run refactor how tests are run Nov 3, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 3, 2022
@deads2k
Copy link
Contributor Author

deads2k commented Nov 3, 2022

The test numbers aren't matching up exactly, but they are close. I'll have to open a no-op PR on the same base here and write something to compare which test are missing. But I think we're ready for a review.

@deads2k deads2k mentioned this pull request Nov 3, 2022
@deads2k
Copy link
Contributor Author

deads2k commented Nov 3, 2022

/test all

@deads2k
Copy link
Contributor Author

deads2k commented Nov 3, 2022

/hold

until we verify against #27519

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2022
Comment on lines 356 to 341
// Run kube, storage, openshift, and must-gather tests. If user specified a count of -1,
// RunTestInNewProcess kube, storage, openshift, and must-gather tests. If user specified a count of -1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have been renamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have been renamed?

No, I should fix.

Comment on lines 139 to 140
// could be running at the same time. While these are technically [Serial], ginkgo
// parallel mode provides this guarantee. Doing this for all suites would be too
Copy link
Member

@stbenjam stbenjam Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we're using this to mark some vendored tests from certain packages to be always serial, but
I don't get this comment:

While these are technically [Serial], ginkgo parallel mode provides this guarantee.

What guarantee?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this comment, it's very confusing.
While these are technically [Serial], ginkgo parallel mode provides this guarantee.
What guarantee?

TBH, I don't actually know. It came from clayton and I preserved it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing this thread, according to maciej, we don't even need this block because it has been broken for four releases.

@stbenjam
Copy link
Member

stbenjam commented Nov 3, 2022

/retest-required

CI jobs didn't get scheduled

@deads2k
Copy link
Contributor Author

deads2k commented Nov 3, 2022

/retest

Comment on lines +285 to +286
testOutputLock := &sync.Mutex{}
testOutputConfig := newTestOutputConfig(testOutputLock, opt.Out, monitorEventRecorder, includeSuccess)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there's a real lock here, do we know yet how much if any this slows things down?

Copy link
Contributor Author

@deads2k deads2k Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there's a real lock here, do we know yet how much if any this slows things down?

looks like at most 10 minutes on parallel runs. Most runs appear to be about the same.

@deads2k
Copy link
Contributor Author

deads2k commented Nov 3, 2022

the counts are checking out ok to me.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2022
@deads2k
Copy link
Contributor Author

deads2k commented Nov 3, 2022

The following tests are no longer run:

: [bz-[sig-apps][Feature:OpenShiftControllerManager]] clusteroperator/[sig-apps][Feature:OpenShiftControllerManager] should not change condition/Available
: [bz-[sig-apps][Feature:OpenShiftControllerManager]] clusteroperator/[sig-apps][Feature:OpenShiftControllerManager] should not change condition/Degraded
: [bz-[sig-network][Feature:EgressFirewall]] clusteroperator/[sig-network][Feature:EgressFirewall] should not change condition/Available
: [bz-[sig-network][Feature:EgressFirewall]] clusteroperator/[sig-network][Feature:EgressFirewall] should not change condition/Degraded
: [bz-[sig-network][Feature:Network] clusteroperator/[sig-network][Feature:Network should not change condition/Available
: [bz-[sig-network][Feature:Network] clusteroperator/[sig-network][Feature:Network should not change condition/Degraded
: [bz-[sig-scheduling][Early]] clusteroperator/[sig-scheduling][Early] should not change condition/Available
: [bz-[sig-scheduling][Early]] clusteroperator/[sig-scheduling][Early] should not change condition/Degraded

}

timeout := opt.Timeout
if timeout == 0 {
Copy link

@kikisdeliveryservice kikisdeliveryservice Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone double check my understanding of this double timeout == 0 check?
It seems like the logic is essentially:
if opt.Timeout == 0:
| if suite.TestTimeout == 0:
|| timeout = 15*time.Minute
| else: timeout = suite.TestTimeout
else:
| timeout = opt.Timeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

We're picking the first non-zero value from this:

  • opt.Timeout
  • suite.TestTimeout
  • 15 minutes

r = r.Next()
}
q.queue = r
remainingParallelTests := make(chan *testCase, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is 100 here is the max number of tests that will run in parallel -- do I have that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could run any number of tests in parallel. This is replacing the ring list it used earlier. The channel holds up to 100 test cases but it constantly gets refed by the go routine below (queueAllTests). Then if our parallelism is say, 30, it launches 30 go routines in the for loop below that each consume test cases from channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, so I thought the for-loop below was launching n (n=parallelism) parallel tests at a time but actually it's launching n go funcs which all pull 1 test at a time from the channel which results in running up to n parallel tests at a time.

@stbenjam
Copy link
Member

stbenjam commented Nov 4, 2022

/lgtm
/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 4, 2022

@deads2k: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-rt-upgrade 6482d6f link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 6482d6f link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-single-node-upgrade 6482d6f link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-gcp-csi 6482d6f link false /test e2e-gcp-csi
ci/prow/e2e-agnostic-ovn-cmd 6482d6f link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-openstack-ovn 6482d6f link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-single-node-serial 6482d6f link false /test e2e-aws-ovn-single-node-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 97b36df into openshift:master Nov 4, 2022
dgoodwin added a commit to dgoodwin/origin that referenced this pull request Nov 15, 2022
In PR openshift#27516 we suspect reporting of flakes broke due to a missed
assumption that test.flake accompanied test.success. Our new goal is to
more clearly have just one status set, so we're going to lean into the
new approach and properly break out the flake state into it's own case.
dgoodwin added a commit to dgoodwin/origin that referenced this pull request Nov 16, 2022
In PR openshift#27516 we suspect reporting of flakes broke due to a missed
assumption that test.flake accompanied test.success. Our new goal is to
more clearly have just one status set, so we're going to lean into the
new approach and properly break out the flake state into it's own case.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/origin that referenced this pull request Nov 17, 2022
In PR openshift#27516 we suspect reporting of flakes broke due to a missed
assumption that test.flake accompanied test.success. Our new goal is to
more clearly have just one status set, so we're going to lean into the
new approach and properly break out the flake state into it's own case.
tjungblu pushed a commit to tjungblu/origin that referenced this pull request Apr 11, 2023
In PR openshift#27516 we suspect reporting of flakes broke due to a missed
assumption that test.flake accompanied test.success. Our new goal is to
more clearly have just one status set, so we're going to lean into the
new approach and properly break out the flake state into it's own case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants