Add test for scaling machineSets#22564
Add test for scaling machineSets#22564openshift-merge-robot merged 1 commit intoopenshift:masterfrom
Conversation
|
/retest |
| initialReplicasMachineSet := getMachineSetReplicaNumber(machineSet) | ||
| g.By(fmt.Sprintf("scaling %q from %d to %d replicas", machineName(machineSet), initialReplicasMachineSet, expectedScaleOut)) | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| err = scaleMachineSet(machineName(machineSet), expectedScaleOut) |
There was a problem hiding this comment.
can we not do for? so we scale both sets out immediately. Otherwise the second one only scales up when the first one finished all its validations
|
/test e2e-aws |
| return nil, fmt.Errorf("error getting config: %v", err) | ||
| } | ||
|
|
||
| discoveryClient := discovery.NewDiscoveryClientForConfigOrDie(cfg) |
There was a problem hiding this comment.
It seems odd that most of the function handles and return error but this can die immediately. Why not return error here too?
| scaleUpdate.Spec.Replicas = int32(replicas) | ||
| _, err = scaleClient.Scales(machineAPINamespace).Update(schema.GroupResource{Group: machineAPIGroup, Resource: "MachineSet"}, scaleUpdate) | ||
| if err != nil { | ||
| return fmt.Errorf("error calling scaleClient.Scales update: %v", err) |
There was a problem hiding this comment.
Also annotate with replicas. It may help in further debug sessions to see why scaling to $N replicas fails.
|
You should set “Serial” instead of Disruptive on this while testing it so you can see the e2e-aws-serial suite run it. If total runtime is not terribly bad we can keep in there or create a new suite for it (in the long run we’ll do this, just for now it’s better to be testing) |
813758e to
dde9a9c
Compare
| return nil, err | ||
| } | ||
| machineSets := objx.Map(obj.UnstructuredContent()) | ||
| items := objects(machineSets.Get("items")) |
There was a problem hiding this comment.
Just return objects(machineSets.Get("items")), nil - the temporary doesn't do anything.
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
|
|
||
| // expect new nodes to come up for machineSet0 | ||
| o.Eventually(func() bool { |
There was a problem hiding this comment.
Can we share the body of this func (as a literal) for both machineSet0 and machineSet1 - it looks identical.
| } | ||
| } | ||
| return len(nodes) == expectedScaleOut | ||
| }, scalingTime, 5*time.Second).Should(o.BeTrue()) |
There was a problem hiding this comment.
This assertion should be done at the call site.
| err = scaleMachineSet(machineName(machineSet1), expectedScaleOut) | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
|
|
||
| verifyNodeScalingFunc(c, dc, expectedScaleOut, machineSet0) |
There was a problem hiding this comment.
This should assert true/false based on the bool result of verifyNodeScalingFunc.
|
/lgtm |
|
/approve |
|
ping @derekwaynecarr @mfojtik for approval |
|
/test all |
|
/test e2e-aws-serial |
|
/test e2e-aws-upgrade |
|
|
||
| var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() { | ||
| g.It("grow and decrease when scaling different machineSets simultaneously", func() { | ||
| // expect new nodes to come up for machineSet |
There was a problem hiding this comment.
You need a skip for platforms which don’t support scaling
e2e-aws-serial passed after 2h26m36s and this is the longest running job. |
|
|
||
| // fetch nodes | ||
| allWorkerNodes, err := c.CoreV1().Nodes().List(metav1.ListOptions{ | ||
| LabelSelector: nodeLabelSelectorWorker, |
There was a problem hiding this comment.
It is not required that a cluster have worker selector nodes. Will this e2e test basically only run if you have worker nodes? Or will it fail if you don't?
There was a problem hiding this comment.
Today we have machinesets only for the worker nodes in a newly created cluster by default. This is the assumption here. Though omitting worker label selector in the listing, will also be fine. Yes, test wil fail if there are no worker nodes in the cluster.
There was a problem hiding this comment.
I think you should skip if there is no worker machine set with a clear message, rather than fail.
| nodeList, err := c.CoreV1().Nodes().List(metav1.ListOptions{ | ||
| LabelSelector: nodeLabelSelectorWorker, | ||
| }) | ||
| o.Expect(err).NotTo(o.HaveOccurred()) |
There was a problem hiding this comment.
If this is empty, I would expect this test to be skipped. Also, why aren't you passing nodeLabelSelector to getNodesFromMachineSet?
There was a problem hiding this comment.
Is this ok if there is no worker node in a newly created cluster? Is there a job which creates such a non-worker cluster?
There was a problem hiding this comment.
There will be. We will be adding jobs that create 3 master clusters that run the e2e tests. In that scenario this test should be skipped (probably), or when we add that job we can change the logic here.
|
ping @smarterclayton |
|
Adding 20 minutes to serial runs is a lot. What can you do to reduce the time this test takes to 8-10 minutes? |
Not sure if this test really adding additional 20 mins overall. I checked in the logs and this test seems to be taking ~3-4 mins: Also i ran locally and verified time taken by this test. Local run too took ~4 mins Lets run couple more tims to see if it really adding ~20 mins |
|
This time e2e-aws-serial passed after 2h12m24s. |
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
|
/retest |
1 similar comment
|
/retest |
|
/test e2e-aws |
@smarterclayton test is taking just ~4 min. |
|
/lgtm Great test |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre, frobware, smarterclayton, vikaschoudhary16 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Using the 'upgrade-all' precedent from cfcd60f (release: Standardize all ci-chat-bot jobs, 2020-04-27, openshift#8594). I'm not clear on why we are joining with a newline instead of '&&'; presumably this is getting wrapped in a 'set -e' or equivalent. But I'm sticking with newline to match precedent. This increases the risk that we time out these slow jobs (e.g. [1] took 3h42m), but we really want to exercise tests like openshift/origin@9f7fe0089d (Add test for scaling machineSets, 2019-04-11, openshift/origin#22564), which is in openshift/conformance/serial, because machines launch with the born-in boot images until we get [2]. And in fact, the reason why we didn't have this post-update suite in 4.6 was because of 3bc9d8e (stop running e2e tests after three upgrades because we hit timeouts and lose upgrade signal, 2020-10-05, openshift#12436). But since 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), we no longer have to worry about getting logs when that step is slow. So we might not pass if we're slow, but we'll still get logs to debug why we're slow. Only for 4.6 and later, because 4.5 is live and if we had problems there we'd probably have already heard about them from customers. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/1318709056830967808 [2]: openshift/enhancements#201
Continuation of Alberto's PR #22544
Scaling machines/nodes is a feature we support. From modifying a replica number to having a new running node there are multiple components involved: machine api, networking, container runtime, mco, etc. This is a gate to prevent any component to break this feature, e.g https://bugzilla.redhat.com/show_bug.cgi?id=1698253, https://bugzilla.redhat.com/show_bug.cgi?id=1698624
list machineSets
Scale current replicas of each machineSet to 3
Verify new nodes are created and go ready for each machineSet
Scale down to the original replica number
Verify the final number of worker nodes in the cluster match the original