-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Update etcd scaling test for CPMS supported platforms #27497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update etcd scaling test for CPMS supported platforms #27497
Conversation
8a609d9 to
89498f9
Compare
|
/test e2e-aws-ovn-etcd-scaling |
|
@hasbro17: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-aws-ovn-etcd-scaling |
|
/test e2e-gcp-ovn-etcd-scaling |
|
/test e2e-gcp-ovn-etcd-scaling |
89498f9 to
1e20d7d
Compare
|
/hold |
1e20d7d to
0345bcd
Compare
| // step 2: wait until the CPMSO scales-up by creating a new machine | ||
| // We need to check the cpms' status.readyReplicas because the phase of one machine will always be Deleting | ||
| // so we can't use EnsureMasterMachinesAndCount() since that looks for non-Deleting machines | ||
| err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 4, cpmsClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should watch for replicas instead of readyReplicas. The latter may never surge to 4 in this case as the deleted machine's member can get removed before we scale up.
Member removed at 19:19:21
2022-12-15T19:19:21.221655410Z I1215 19:19:21.221269 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"16340e00-06fa-4526-9dc4-3da772936f35", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberRemove' removed member with ID: 2485787822400865737
2022-12-15T19:19:21.221655410Z I1215 19:19:21.221529 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"16340e00-06fa-4526-9dc4-3da772936f35", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' successfully removed member: [ url: 10.0.182.231, name: ip-10-0-182-231.ec2.internal, id: 2485787822400865737 ] from the cluster
Member added as learner at 19:24:50:
2022-12-15T19:24:50.597610435Z I1215 19:24:50.597552 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"16340e00-06fa-4526-9dc4-3da772936f35", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAddAsLearner' successfully added new member https://10.0.152.48:2380
Looking back into the clustermemberremoval controller and in particular the PR(openshift/cluster-etcd-operator#947) to relax the scale-down constraint for unhealthy members where I think we may have inadvertently relaxed this for a healthy member too. Need to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for the fix to not allow scale-down or removal of the machine pending deletion before the new one is added openshift/cluster-etcd-operator#978
0345bcd to
610dde7
Compare
610dde7 to
222ce3d
Compare
|
Looks like the node never came up for the azure run. Retesting: |
|
Okay so this finally works. For CPMS supported platforms (AWS and Azure currently) the test is relying the CPMSO to create the machine while we're reverting back to the manual deletion-then-creation workflow for unsupported platforms (GCP, Vsphere). The scaling test itself is passing on azure but tripping up on a familiar (albeit unresolved) issue: This should be good to review now. |
|
Retesting but will eventually override the azure test if it keeps failing on the known pathological events case. /test e2e-azure-ovn-etcd-scaling |
|
/retest-required |
|
|
||
| if cpmsSupported { | ||
|
|
||
| // TODO: Add cleanup step to recover back to 3 running machines and members if the test fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still have serial jobs where this could become an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not currently no. But we may have other scaling related tests that may have to run serially in the future. But I'd prefer to leave that as a todo for when we actually have those.
| err = errors.Wrap(err, "pre-test: failed to determine if ControlPlaneMachineSet is present") | ||
| o.Expect(err).ToNot(o.HaveOccurred()) | ||
|
|
||
| if cpmsSupported { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we can move that whole if code block into a dedicated test function? have the setup code above as an initializer for both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move the setup code inside BeforeEach() so we do the common setup before each It() spec block.
And I would have two It() spec blocks in that case which would both always run except one would always get skipped based on the conditional. Just to illustrate what I'm thinking:
g.Describe("[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling] etcd [apigroup:config.openshift.io]", func() {
var cpmsActive bool
// Plus all the other common setup variables to pass to the It() closures below
g.BeforeEach(func() {
// Common setup and compute cpmsActive
cpmsActive := ...
})
g.It("is able to vertically scale up and down with a single node when CPMS is active [Timeout:60m][apigroup:machine.openshift.io]", func() {
if cpmsActive {
Skip("CPMS is inactive so cannot use CPMS")
}
// New CPMS based workflow
}
g.It("is able to vertically scale up and down with a single node when CPMS is inactive [Timeout:60m][apigroup:machine.openshift.io]", func() {
if cpmsActive {
Skip("CPMS is active so cannot manually create machines")
}
// Old workflow
}
}Grammatically this is a bit awkward and not sure the conditional is too clear here.
I'm no ginkgo expert but I would think something like the following reads better with a When or Describe block which are both equivalent:
g.When("CPMS is active", func() {
g.It("is able to vertically scale up and down with a single node when CPMS is active [Timeout:60m][apigroup:machine.openshift.io]", func() {
...
}
}
g.When("CPMS is inactive", func() {
g.It("is able to vertically scale up and down with a single node when CPMS is active [Timeout:60m][apigroup:machine.openshift.io]", func() {
...
}
}But again I'm not sure how to conditionally run either When() block here so that's why I used It() in the former example.
But let me know if that's what you meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking along the lines of
scalingtestinglibrary.SkipIfUnsupportedPlatform(ctx, oc)
so your first description fits the best. I just want to save some refactoring steps for ETCD-330 - where I would also need to detect whether CPMS active or not.
The g.When() indeed reads better, but if it's taking too much time for you to figure it out you can leave this for me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay I get the context for this now. I'm still not liking the former layout of skipping one of the two It() blocks and haven't reviewed https://github.com/openshift/origin/pull/27461/files enough to see if we can't make that an entirely separate test case for that.
The quorum checker test would even need to disable the CPMS as well (when active) so we would have to serialize or order that so it can clean up and not be disruptive.
In the interest of landing a smaller fix temporarily (to improve pass rates and unblock other debugging efforts), I would lean towards punting the refactor for when we actually have a second test case. But not to say it's your problem :) I will refactor the test to make it more Ginkgo-readable as a follow up.
This would be easier to backport to 4.12 as well in case we have to refactor again in ETCD-330. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and haven't reviewed https://github.com/openshift/origin/pull/27461/files enough to see if we can't make that an entirely separate test case for that.
save yourself the time, I'll likely write this again from scratch.
The quorum checker test would even need to disable the CPMS as well (when active) so we would have to serialize or order that so it can clean up and not be disruptive.
I personally would just have the checker test run when there is no CPMS.
I would lean towards punting the refactor for when we actually have a second test case. But not to say it's your problem :) I will refactor the test to make it more Ginkgo-readable as a follow up.
All good, I'm fine with refactoring this in the other ticket. Let's get the pass rates fixed first before we complicate.
| } | ||
|
|
||
| // No need to be deterministic since it doesn't matter which machine we delete | ||
| machineToDelete = machineList.Items[0].Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there might be two invocations of the wait.Poll, especially if the delete failed on some transient client/reponse error. So we could end up with two deleted masters here.
I would just remove the wait.Poll altogether and rely on the machineClient retries on a single chosen machine name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'm going to make it deterministic by always picking the master name with the lowest index since all machines are suffixed e.g ...-master-0 ...-master-1 etc.
remove the wait.Poll altogether and rely on the machineClient retries
The wait.Poll is for retrying, so not sure if you meant replacing it with another util? The machineClient won't auto retry. Or do you mean just loop the Delete until it goes through. Would need some backoff though in case of transient API errors which is what the wait.Poll does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, thank you. I thought that there's some retry utility baked into the clients already. As long as you pick the master deterministically it's fine :)
| t.Logf("attempting to delete machine %q", machineToDelete) | ||
|
|
||
| if err := machineClient.Delete(ctx, machineToDelete, metav1.DeleteOptions{}); err != nil { | ||
| return isTransientAPIError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also regarding the polling, if it's already deleted this will error out. So I think we need another:
if apierrors.IsNotFound(err) {
For platforms where the ControlPlaneMachineSet is active and being reconciled by the CPMSO, the vertical scaling test should rely on the CPMSO to remove and add new machines, otherwise there is a race between the test removing a machine and the CPMSO adding a new one.
222ce3d to
144c8aa
Compare
| // The machine we just listed should be present but if not, error out | ||
| if apierrors.IsNotFound(err) { | ||
| t.Logf("machine %q was listed but not found or already deleted", machineToDelete) | ||
| return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be:
return true, null
after all, the expectation of the whole function is that you want to delete the machine? 🗡️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we want to delete the machine but having just listed the machine in L93 and then subsequently finding out the machine is no longer present means something has gone wrong.
Deleting the machine doesn't mean it should just go away immediately. You need the replacement machine to be created, added, and promoted as a member, and then have member removal and deletion hook removal etc before the machine should go away.
If at this point deleting the the machine results in an IsNotFound error then the test really can't verify anything anymore in above sequence for the vertical scaling workflow since the machine is already gone.
|
/retest-required Test is fine on vshpere (just the know disruption test failures again). Azure has passed with the new workflow before and looks to be an infra problem. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dusk125, hasbro17 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Scaling jobs look good. Vsphere has passed before, and azure has the test passing but seeing other failures: /retest-required |
|
/retest |
|
@hasbro17: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Needs a backport to 4.12 as well. /cherry-pick release-4.12 Although I forgot to classify this as a 4.13 bug and now need a 4.13 verified and a 4.12 bug to depend on that now |
|
@hasbro17: new pull request created: #27692 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Resolves: https://issues.redhat.com/browse/ETCD-329
The vertical scaling test workflow needs to account for the presence of the ControlPlaneMachineSet(CPMS) in which case it should not manually add new machines which would get deleted by the CPMSO which reconciles the control plane to the desired number of machines in spec.replicas.
This PR adds a new workflow that utilizes the CPMSO to automatically scale-up and scale-down when the platform supports CPMS (or detects its presence).