Skip to content

Add test for scaling machineSets#22564

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
vikaschoudhary16:scale
Jul 31, 2019
Merged

Add test for scaling machineSets#22564
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
vikaschoudhary16:scale

Conversation

@vikaschoudhary16
Copy link
Copy Markdown
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Apr 12, 2019

Continuation of Alberto's PR #22544

Scaling machines/nodes is a feature we support. From modifying a replica number to having a new running node there are multiple components involved: machine api, networking, container runtime, mco, etc. This is a gate to prevent any component to break this feature, e.g https://bugzilla.redhat.com/show_bug.cgi?id=1698253, https://bugzilla.redhat.com/show_bug.cgi?id=1698624

list machineSets
Scale current replicas of each machineSet to 3
Verify new nodes are created and go ready for each machineSet
Scale down to the original replica number
Verify the final number of worker nodes in the cluster match the original

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 12, 2019
@enxebre
Copy link
Copy Markdown
Member

enxebre commented Apr 15, 2019

/retest

initialReplicasMachineSet := getMachineSetReplicaNumber(machineSet)
g.By(fmt.Sprintf("scaling %q from %d to %d replicas", machineName(machineSet), initialReplicasMachineSet, expectedScaleOut))
o.Expect(err).NotTo(o.HaveOccurred())
err = scaleMachineSet(machineName(machineSet), expectedScaleOut)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not do for? so we scale both sets out immediately. Otherwise the second one only scales up when the first one finished all its validations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre done!

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

Copy link
Copy Markdown
Contributor

@frobware frobware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be deterministic the scaling bounds on all the machine sets need to be min:1, max:2 as the default behaviour of the CA in 4.1 is to randomly place nodes in a node group (read: machines in a machine set). Nevermind. On autopilot.

Comment thread test/extended/machines/scale.go Outdated
return nil, fmt.Errorf("error getting config: %v", err)
}

discoveryClient := discovery.NewDiscoveryClientForConfigOrDie(cfg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd that most of the function handles and return error but this can die immediately. Why not return error here too?

Comment thread test/extended/machines/scale.go Outdated
scaleUpdate.Spec.Replicas = int32(replicas)
_, err = scaleClient.Scales(machineAPINamespace).Update(schema.GroupResource{Group: machineAPIGroup, Resource: "MachineSet"}, scaleUpdate)
if err != nil {
return fmt.Errorf("error calling scaleClient.Scales update: %v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also annotate with replicas. It may help in further debug sessions to see why scaling to $N replicas fails.

@smarterclayton
Copy link
Copy Markdown
Contributor

You should set “Serial” instead of Disruptive on this while testing it so you can see the e2e-aws-serial suite run it. If total runtime is not terribly bad we can keep in there or create a new suite for it (in the long run we’ll do this, just for now it’s better to be testing)

@vikaschoudhary16 vikaschoudhary16 force-pushed the scale branch 3 times, most recently from 813758e to dde9a9c Compare April 23, 2019 09:07
Comment thread test/extended/machines/scale.go Outdated
return nil, err
}
machineSets := objx.Map(obj.UnstructuredContent())
items := objects(machineSets.Get("items"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return objects(machineSets.Get("items")), nil - the temporary doesn't do anything.

Comment thread test/extended/machines/scale.go
Comment thread test/extended/machines/scale.go Outdated
o.Expect(err).NotTo(o.HaveOccurred())

// expect new nodes to come up for machineSet0
o.Eventually(func() bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we share the body of this func (as a literal) for both machineSet0 and machineSet1 - it looks identical.

Comment thread test/extended/machines/scale.go Outdated
}
}
return len(nodes) == expectedScaleOut
}, scalingTime, 5*time.Second).Should(o.BeTrue())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion should be done at the call site.

Comment thread test/extended/machines/scale.go Outdated
err = scaleMachineSet(machineName(machineSet1), expectedScaleOut)
o.Expect(err).NotTo(o.HaveOccurred())

verifyNodeScalingFunc(c, dc, expectedScaleOut, machineSet0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should assert true/false based on the bool result of verifyNodeScalingFunc.

Comment thread test/extended/machines/scale.go Outdated
@frobware
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2019
@enxebre
Copy link
Copy Markdown
Member

enxebre commented Apr 30, 2019

/approve

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

ping @derekwaynecarr @mfojtik for approval

@smarterclayton
Copy link
Copy Markdown
Contributor

/test all

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-serial

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-upgrade


var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
g.It("grow and decrease when scaling different machineSets simultaneously", func() {
// expect new nodes to come up for machineSet
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a skip for platforms which don’t support scaling

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

How long does the test take now?

e2e-aws-serial passed after 2h26m36s and this is the longest running job.


// fetch nodes
allWorkerNodes, err := c.CoreV1().Nodes().List(metav1.ListOptions{
LabelSelector: nodeLabelSelectorWorker,
Copy link
Copy Markdown
Contributor

@smarterclayton smarterclayton Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not required that a cluster have worker selector nodes. Will this e2e test basically only run if you have worker nodes? Or will it fail if you don't?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today we have machinesets only for the worker nodes in a newly created cluster by default. This is the assumption here. Though omitting worker label selector in the listing, will also be fine. Yes, test wil fail if there are no worker nodes in the cluster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should skip if there is no worker machine set with a clear message, rather than fail.

nodeList, err := c.CoreV1().Nodes().List(metav1.ListOptions{
LabelSelector: nodeLabelSelectorWorker,
})
o.Expect(err).NotTo(o.HaveOccurred())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is empty, I would expect this test to be skipped. Also, why aren't you passing nodeLabelSelector to getNodesFromMachineSet?

Copy link
Copy Markdown
Contributor Author

@vikaschoudhary16 vikaschoudhary16 Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ok if there is no worker node in a newly created cluster? Is there a job which creates such a non-worker cluster?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be. We will be adding jobs that create 3 master clusters that run the e2e tests. In that scenario this test should be skipped (probably), or when we add that job we can change the logic here.

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

ping @smarterclayton

@smarterclayton
Copy link
Copy Markdown
Contributor

Adding 20 minutes to serial runs is a lot. What can you do to reduce the time this test takes to 8-10 minutes?

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

Adding 20 minutes to serial runs is a lot. What can you do to reduce the time this test takes to 8-10 minutes?

Not sure if this test really adding additional 20 mins overall. I checked in the logs and this test seems to be taking ~3-4 mins:

started: (0/91/212) "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

passed: (3m30s) 2019-07-03T10:21:38 "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

Also i ran locally and verified time taken by this test. Local run too took ~4 mins

Lets run couple more tims to see if it really adding ~20 mins

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 8, 2019
@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

This time e2e-aws-serial passed after 2h12m24s.

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

1 similar comment
@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/retest

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

started: (0/175/218) "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

passed: (4m12s) 2019-07-31T06:53:03 "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

@smarterclayton test is taking just ~4 min.

@smarterclayton
Copy link
Copy Markdown
Contributor

/lgtm

Great test

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 31, 2019
@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, frobware, smarterclayton, vikaschoudhary16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2019
@openshift-merge-robot openshift-merge-robot merged commit 5ece8fa into openshift:master Jul 31, 2019
wking added a commit to wking/openshift-release that referenced this pull request Oct 22, 2020
Using the 'upgrade-all' precedent from cfcd60f (release:
Standardize all ci-chat-bot jobs, 2020-04-27, openshift#8594).  I'm not clear
on why we are joining with a newline instead of '&&'; presumably this
is getting wrapped in a 'set -e' or equivalent.  But I'm sticking with
newline to match precedent.

This increases the risk that we time out these slow jobs (e.g. [1]
took 3h42m), but we really want to exercise tests like
openshift/origin@9f7fe0089d (Add test for scaling machineSets,
2019-04-11, openshift/origin#22564), which is in
openshift/conformance/serial, because machines launch with the born-in
boot images until we get [2].

And in fact, the reason why we didn't have this post-update suite in
4.6 was because of 3bc9d8e (stop running e2e tests after three
upgrades because we hit timeouts and lose upgrade signal, 2020-10-05, openshift#12436).
But since 3c915e2 (ci-operator/step-registry/openshift/e2e/test:
Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), we no longer have
to worry about getting logs when that step is slow.  So we might not
pass if we're slow, but we'll still get logs to debug why we're slow.

Only for 4.6 and later, because 4.5 is live and if we had problems
there we'd probably have already heard about them from customers.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/1318709056830967808
[2]: openshift/enhancements#201
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants