Add test for scaling machineSets by vikaschoudhary16 · Pull Request #22564 · openshift/origin

vikaschoudhary16 · 2019-04-12T19:43:26Z

Continuation of Alberto's PR #22544

Scaling machines/nodes is a feature we support. From modifying a replica number to having a new running node there are multiple components involved: machine api, networking, container runtime, mco, etc. This is a gate to prevent any component to break this feature, e.g https://bugzilla.redhat.com/show_bug.cgi?id=1698253, https://bugzilla.redhat.com/show_bug.cgi?id=1698624

list machineSets
Scale current replicas of each machineSet to 3
Verify new nodes are created and go ready for each machineSet
Scale down to the original replica number
Verify the final number of worker nodes in the cluster match the original

enxebre · 2019-04-15T08:25:09Z

/retest

enxebre · 2019-04-15T15:50:04Z

+			initialReplicasMachineSet := getMachineSetReplicaNumber(machineSet)
+			g.By(fmt.Sprintf("scaling %q from %d to %d replicas", machineName(machineSet), initialReplicasMachineSet, expectedScaleOut))
+			o.Expect(err).NotTo(o.HaveOccurred())
+			err = scaleMachineSet(machineName(machineSet), expectedScaleOut)


can we not do for? so we scale both sets out immediately. Otherwise the second one only scales up when the first one finished all its validations

@enxebre done!

vikaschoudhary16 · 2019-04-16T06:05:40Z

/test e2e-aws

frobware

To be deterministic the scaling bounds on all the machine sets need to be min:1, max:2 as the default behaviour of the CA in 4.1 is to randomly place nodes in a node group (read: machines in a machine set). Nevermind. On autopilot.

frobware · 2019-04-16T09:35:20Z

+		return nil, fmt.Errorf("error getting config: %v", err)
+	}
+
+	discoveryClient := discovery.NewDiscoveryClientForConfigOrDie(cfg)


It seems odd that most of the function handles and return error but this can die immediately. Why not return error here too?

frobware · 2019-04-16T09:37:57Z

+	scaleUpdate.Spec.Replicas = int32(replicas)
+	_, err = scaleClient.Scales(machineAPINamespace).Update(schema.GroupResource{Group: machineAPIGroup, Resource: "MachineSet"}, scaleUpdate)
+	if err != nil {
+		return fmt.Errorf("error calling scaleClient.Scales update: %v", err)


Also annotate with replicas. It may help in further debug sessions to see why scaling to $N replicas fails.

smarterclayton · 2019-04-16T13:20:20Z

You should set “Serial” instead of Disruptive on this while testing it so you can see the e2e-aws-serial suite run it. If total runtime is not terribly bad we can keep in there or create a new suite for it (in the long run we’ll do this, just for now it’s better to be testing)

frobware · 2019-04-24T08:49:49Z

+		return nil, err
+	}
+	machineSets := objx.Map(obj.UnstructuredContent())
+	items := objects(machineSets.Get("items"))


Just return objects(machineSets.Get("items")), nil - the temporary doesn't do anything.

frobware · 2019-04-24T08:59:11Z

+		o.Expect(err).NotTo(o.HaveOccurred())
+
+		// expect new nodes to come up for machineSet0
+		o.Eventually(func() bool {


Can we share the body of this func (as a literal) for both machineSet0 and machineSet1 - it looks identical.

frobware · 2019-04-24T13:32:06Z

+					}
+				}
+				return len(nodes) == expectedScaleOut
+			}, scalingTime, 5*time.Second).Should(o.BeTrue())


This assertion should be done at the call site.

frobware · 2019-04-24T13:33:55Z

+		err = scaleMachineSet(machineName(machineSet1), expectedScaleOut)
+		o.Expect(err).NotTo(o.HaveOccurred())
+
+		verifyNodeScalingFunc(c, dc, expectedScaleOut, machineSet0)


This should assert true/false based on the bool result of verifyNodeScalingFunc.

frobware · 2019-04-25T08:26:16Z

/lgtm

enxebre · 2019-04-30T07:30:42Z

/approve

vikaschoudhary16 · 2019-04-30T07:47:36Z

ping @derekwaynecarr @mfojtik for approval

smarterclayton · 2019-07-01T13:22:45Z

/test all

vikaschoudhary16 · 2019-07-01T15:58:44Z

/test e2e-aws-serial

vikaschoudhary16 · 2019-07-01T16:51:39Z

/test e2e-aws-upgrade

smarterclayton · 2019-07-03T15:06:07Z

+
+var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
+	g.It("grow and decrease when scaling different machineSets simultaneously", func() {
+		// expect new nodes to come up for machineSet


You need a skip for platforms which don’t support scaling

vikaschoudhary16 · 2019-07-03T16:30:24Z

How long does the test take now?

e2e-aws-serial passed after 2h26m36s and this is the longest running job.

smarterclayton · 2019-07-03T16:56:55Z

+
+	// fetch nodes
+	allWorkerNodes, err := c.CoreV1().Nodes().List(metav1.ListOptions{
+		LabelSelector: nodeLabelSelectorWorker,


It is not required that a cluster have worker selector nodes. Will this e2e test basically only run if you have worker nodes? Or will it fail if you don't?

Today we have machinesets only for the worker nodes in a newly created cluster by default. This is the assumption here. Though omitting worker label selector in the listing, will also be fine. Yes, test wil fail if there are no worker nodes in the cluster.

I think you should skip if there is no worker machine set with a clear message, rather than fail.

smarterclayton · 2019-07-03T16:57:53Z

+		nodeList, err := c.CoreV1().Nodes().List(metav1.ListOptions{
+			LabelSelector: nodeLabelSelectorWorker,
+		})
+		o.Expect(err).NotTo(o.HaveOccurred())


If this is empty, I would expect this test to be skipped. Also, why aren't you passing nodeLabelSelector to getNodesFromMachineSet?

Is this ok if there is no worker node in a newly created cluster? Is there a job which creates such a non-worker cluster?

There will be. We will be adding jobs that create 3 master clusters that run the e2e tests. In that scenario this test should be skipped (probably), or when we add that job we can change the logic here.

vikaschoudhary16 · 2019-07-05T04:55:22Z

ping @smarterclayton

smarterclayton · 2019-07-08T01:59:29Z

Adding 20 minutes to serial runs is a lot. What can you do to reduce the time this test takes to 8-10 minutes?

vikaschoudhary16 · 2019-07-08T19:18:16Z

Adding 20 minutes to serial runs is a lot. What can you do to reduce the time this test takes to 8-10 minutes?

Not sure if this test really adding additional 20 mins overall. I checked in the logs and this test seems to be taking ~3-4 mins:

started: (0/91/212) "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

passed: (3m30s) 2019-07-03T10:21:38 "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

Also i ran locally and verified time taken by this test. Local run too took ~4 mins

Lets run couple more tims to see if it really adding ~20 mins

vikaschoudhary16 · 2019-07-08T23:51:39Z

This time e2e-aws-serial passed after 2h12m24s.

vikaschoudhary16 · 2019-07-09T00:54:10Z

/test e2e-aws

vikaschoudhary16 · 2019-07-09T04:08:52Z

/test e2e-aws

vikaschoudhary16 · 2019-07-22T07:04:47Z

/retest

vikaschoudhary16 · 2019-07-31T05:10:40Z

/retest

vikaschoudhary16 · 2019-07-31T08:08:00Z

/test e2e-aws

vikaschoudhary16 · 2019-07-31T10:29:18Z

started: (0/175/218) "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

passed: (4m12s) 2019-07-31T06:53:03 "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

@smarterclayton test is taking just ~4 min.

smarterclayton · 2019-07-31T13:36:38Z

/lgtm

Great test

openshift-ci-robot · 2019-07-31T13:36:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, frobware, smarterclayton, vikaschoudhary16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Using the 'upgrade-all' precedent from cfcd60f (release: Standardize all ci-chat-bot jobs, 2020-04-27, openshift#8594). I'm not clear on why we are joining with a newline instead of '&&'; presumably this is getting wrapped in a 'set -e' or equivalent. But I'm sticking with newline to match precedent. This increases the risk that we time out these slow jobs (e.g. [1] took 3h42m), but we really want to exercise tests like openshift/origin@9f7fe0089d (Add test for scaling machineSets, 2019-04-11, openshift/origin#22564), which is in openshift/conformance/serial, because machines launch with the born-in boot images until we get [2]. And in fact, the reason why we didn't have this post-update suite in 4.6 was because of 3bc9d8e (stop running e2e tests after three upgrades because we hit timeouts and lose upgrade signal, 2020-10-05, openshift#12436). But since 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), we no longer have to worry about getting logs when that step is slow. So we might not pass if we're slow, but we'll still get logs to debug why we're slow. Only for 4.6 and later, because 4.5 is live and if we had problems there we'd probably have already heard about them from customers. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/1318709056830967808 [2]: openshift/enhancements#201

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 12, 2019

openshift-ci-robot requested review from gabemontero and knobunc April 12, 2019 19:45

enxebre mentioned this pull request Apr 15, 2019

Add test for scaling machineSets #22544

Closed

enxebre reviewed Apr 15, 2019

View reviewed changes

vikaschoudhary16 force-pushed the scale branch from d48b9a1 to c77345f Compare April 15, 2019 18:13

frobware suggested changes Apr 16, 2019

View reviewed changes

vikaschoudhary16 force-pushed the scale branch 3 times, most recently from 813758e to dde9a9c Compare April 23, 2019 09:07

frobware reviewed Apr 24, 2019

View reviewed changes

vikaschoudhary16 force-pushed the scale branch from dde9a9c to 3db95ac Compare April 24, 2019 12:41

frobware reviewed Apr 24, 2019

View reviewed changes

vikaschoudhary16 force-pushed the scale branch from 3db95ac to 7736231 Compare April 24, 2019 19:02

frobware reviewed Apr 25, 2019

View reviewed changes

Comment thread test/extended/machines/scale.go Outdated

vikaschoudhary16 force-pushed the scale branch from 7736231 to 222d477 Compare April 25, 2019 07:10

frobware approved these changes Apr 25, 2019

View reviewed changes

vikaschoudhary16 force-pushed the scale branch from 222d477 to c348ebf Compare April 25, 2019 08:17

openshift-ci-robot assigned frobware Apr 25, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2019

smarterclayton suggested changes Jul 3, 2019

View reviewed changes

smarterclayton reviewed Jul 3, 2019

View reviewed changes

vikaschoudhary16 force-pushed the scale branch from 853d1f7 to a1d07d1 Compare July 8, 2019 19:40

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 8, 2019

Add test for scaling machineSets

9f7fe00

vikaschoudhary16 force-pushed the scale branch from a1d07d1 to 9f7fe00 Compare July 22, 2019 05:16

openshift-ci-robot assigned smarterclayton Jul 31, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 31, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2019

openshift-merge-robot merged commit 5ece8fa into openshift:master Jul 31, 2019

wking mentioned this pull request Oct 21, 2020

ci-operator/jobs/openshift/release: Use upgrade-all in chained updates openshift/release#13002

Closed

Conversation

vikaschoudhary16 commented Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enxebre commented Apr 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 commented Apr 16, 2019

Uh oh!

frobware left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Apr 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frobware commented Apr 25, 2019

Uh oh!

enxebre commented Apr 30, 2019

Uh oh!

vikaschoudhary16 commented Apr 30, 2019

Uh oh!

smarterclayton commented Jul 1, 2019

Uh oh!

vikaschoudhary16 commented Jul 1, 2019

Uh oh!

vikaschoudhary16 commented Jul 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 commented Jul 3, 2019

Uh oh!

smarterclayton Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 commented Jul 5, 2019

Uh oh!

smarterclayton commented Jul 8, 2019

Uh oh!

vikaschoudhary16 commented Jul 8, 2019

Uh oh!

vikaschoudhary16 commented Jul 8, 2019

Uh oh!

vikaschoudhary16 commented Jul 9, 2019

Uh oh!

vikaschoudhary16 commented Jul 9, 2019

Uh oh!

vikaschoudhary16 commented Jul 22, 2019

Uh oh!

vikaschoudhary16 commented Jul 31, 2019

Uh oh!

vikaschoudhary16 commented Apr 12, 2019 •

edited

Loading

frobware left a comment •

edited

Loading

smarterclayton Jul 3, 2019 •

edited

Loading

vikaschoudhary16 Jul 3, 2019 •

edited

Loading