-
Notifications
You must be signed in to change notification settings - Fork 4.8k
refactor how tests are run #27516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor how tests are run #27516
Conversation
b0c07b9 to
a4383d6
Compare
|
The test numbers aren't matching up exactly, but they are close. I'll have to open a no-op PR on the same base here and write something to compare which test are missing. But I think we're ready for a review. |
|
/test all |
|
/hold until we verify against #27519 |
pkg/test/ginkgo/cmd_runsuite.go
Outdated
| // Run kube, storage, openshift, and must-gather tests. If user specified a count of -1, | ||
| // RunTestInNewProcess kube, storage, openshift, and must-gather tests. If user specified a count of -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have been renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have been renamed?
No, I should fix.
pkg/test/ginkgo/queue.go
Outdated
| // could be running at the same time. While these are technically [Serial], ginkgo | ||
| // parallel mode provides this guarantee. Doing this for all suites would be too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we're using this to mark some vendored tests from certain packages to be always serial, but
I don't get this comment:
While these are technically [Serial], ginkgo parallel mode provides this guarantee.
What guarantee?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this comment, it's very confusing.
While these are technically [Serial], ginkgo parallel mode provides this guarantee.
What guarantee?
TBH, I don't actually know. It came from clayton and I preserved it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing this thread, according to maciej, we don't even need this block because it has been broken for four releases.
|
/retest-required CI jobs didn't get scheduled |
|
/retest |
| testOutputLock := &sync.Mutex{} | ||
| testOutputConfig := newTestOutputConfig(testOutputLock, opt.Out, monitorEventRecorder, includeSuccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there's a real lock here, do we know yet how much if any this slows things down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there's a real lock here, do we know yet how much if any this slows things down?
looks like at most 10 minutes on parallel runs. Most runs appear to be about the same.
|
the counts are checking out ok to me. /hold cancel |
|
The following tests are no longer run: : [bz-[sig-apps][Feature:OpenShiftControllerManager]] clusteroperator/[sig-apps][Feature:OpenShiftControllerManager] should not change condition/Available |
| } | ||
|
|
||
| timeout := opt.Timeout | ||
| if timeout == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone double check my understanding of this double timeout == 0 check?
It seems like the logic is essentially:
if opt.Timeout == 0:
| if suite.TestTimeout == 0:
|| timeout = 15*time.Minute
| else: timeout = suite.TestTimeout
else:
| timeout = opt.Timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
We're picking the first non-zero value from this:
- opt.Timeout
- suite.TestTimeout
- 15 minutes
| r = r.Next() | ||
| } | ||
| q.queue = r | ||
| remainingParallelTests := make(chan *testCase, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption is 100 here is the max number of tests that will run in parallel -- do I have that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could run any number of tests in parallel. This is replacing the ring list it used earlier. The channel holds up to 100 test cases but it constantly gets refed by the go routine below (queueAllTests). Then if our parallelism is say, 30, it launches 30 go routines in the for loop below that each consume test cases from channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, so I thought the for-loop below was launching n (n=parallelism) parallel tests at a time but actually it's launching n go funcs which all pull 1 test at a time from the channel which results in running up to n parallel tests at a time.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@deads2k: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
In PR openshift#27516 we suspect reporting of flakes broke due to a missed assumption that test.flake accompanied test.success. Our new goal is to more clearly have just one status set, so we're going to lean into the new approach and properly break out the flake state into it's own case.
In PR openshift#27516 we suspect reporting of flakes broke due to a missed assumption that test.flake accompanied test.success. Our new goal is to more clearly have just one status set, so we're going to lean into the new approach and properly break out the flake state into it's own case.
In PR openshift#27516 we suspect reporting of flakes broke due to a missed assumption that test.flake accompanied test.success. Our new goal is to more clearly have just one status set, so we're going to lean into the new approach and properly break out the flake state into it's own case.
In PR openshift#27516 we suspect reporting of flakes broke due to a missed assumption that test.flake accompanied test.success. Our new goal is to more clearly have just one status set, so we're going to lean into the new approach and properly break out the flake state into it's own case.
This eliminates flexibility in how tests are started, inlines anonyous functions, and attempts to build a pipeline of data-in to data-out.
It got pretty big and I haven't run it locally yet. need to prove the correct number of tests run and that certain tests are run serially.
TBH, I don't know if it's easier to review as a diff or easier to review as new code for running tests.