Skip to content

Conversation

@hasbro17
Copy link
Contributor

Resolves: https://issues.redhat.com/browse/ETCD-329

The vertical scaling test workflow needs to account for the presence of the ControlPlaneMachineSet(CPMS) in which case it should not manually add new machines which would get deleted by the CPMSO which reconciles the control plane to the desired number of machines in spec.replicas.

This PR adds a new workflow that utilizes the CPMSO to automatically scale-up and scale-down when the platform supports CPMS (or detects its presence).

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 25, 2022
@openshift-ci openshift-ci bot requested review from csrwng and deads2k October 25, 2022 23:35
@openshift-ci openshift-ci bot added the vendor-update Touching vendor dir or related files label Oct 25, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2022
@openshift-merge-robot openshift-merge-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 28, 2022
@hasbro17 hasbro17 force-pushed the update-scaling-test-to-use-cpms branch from 8a609d9 to 89498f9 Compare December 1, 2022 07:04
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2022
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2022
@hasbro17
Copy link
Contributor Author

hasbro17 commented Dec 1, 2022

/test e2e-aws-ovn-etcd-scaling

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 1, 2022

@hasbro17: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-aws-jenkins
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-image-registry
  • /test e2e-aws-ovn-serial
  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-builds
  • /test e2e-gcp-ovn-image-ecosystem
  • /test e2e-gcp-ovn-upgrade
  • /test extended_gssapi
  • /test extended_ldap_groups
  • /test extended_networking
  • /test images
  • /test lint
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
  • /test e2e-agnostic-ovn-cmd
  • /test e2e-aws
  • /test e2e-aws-csi
  • /test e2e-aws-csi-migration
  • /test e2e-aws-disruptive
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-cgroupsv2
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-ovn-single-node-serial
  • /test e2e-aws-ovn-single-node-upgrade
  • /test e2e-aws-ovn-upgrade
  • /test e2e-aws-proxy
  • /test e2e-azure
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-metal-ipi-sdn
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack-kuryr
  • /test e2e-openstack-ovn
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-ovn-fips
  • pull-ci-openshift-origin-master-e2e-aws-ovn-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-e2e-metal-ipi-sdn
  • pull-ci-openshift-origin-master-e2e-openstack-ovn
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-unit
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps
Details

In response to this:

/test e2e-aws-ovn-etcd-scaling

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hasbro17
Copy link
Contributor Author

hasbro17 commented Dec 7, 2022

/test e2e-aws-ovn-etcd-scaling

@hasbro17
Copy link
Contributor Author

hasbro17 commented Dec 9, 2022

/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

@hasbro17
Copy link
Contributor Author

/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling

@hasbro17 hasbro17 force-pushed the update-scaling-test-to-use-cpms branch from 89498f9 to 1e20d7d Compare December 15, 2022 18:11
@hasbro17 hasbro17 changed the title WIP: Update etcd scaling test for CPMS supported platforms Update etcd scaling test for CPMS supported platforms Dec 15, 2022
@openshift-ci openshift-ci bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 15, 2022
@hasbro17
Copy link
Contributor Author

/hold
Until the scaling jobs pass on all platforms

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2022
@hasbro17 hasbro17 force-pushed the update-scaling-test-to-use-cpms branch from 1e20d7d to 0345bcd Compare December 15, 2022 18:31
Comment on lines 75 to 79
// step 2: wait until the CPMSO scales-up by creating a new machine
// We need to check the cpms' status.readyReplicas because the phase of one machine will always be Deleting
// so we can't use EnsureMasterMachinesAndCount() since that looks for non-Deleting machines
err = scalingtestinglibrary.EnsureReadyReplicasOnCPMS(ctx, g.GinkgoT(), 4, cpmsClient)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should watch for replicas instead of readyReplicas. The latter may never surge to 4 in this case as the deleted machine's member can get removed before we scale up.

From https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27497/pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling/1603457840947662848

Member removed at 19:19:21

2022-12-15T19:19:21.221655410Z I1215 19:19:21.221269       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"16340e00-06fa-4526-9dc4-3da772936f35", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberRemove' removed member with ID: 2485787822400865737
2022-12-15T19:19:21.221655410Z I1215 19:19:21.221529       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"16340e00-06fa-4526-9dc4-3da772936f35", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' successfully removed member: [ url: 10.0.182.231, name: ip-10-0-182-231.ec2.internal, id: 2485787822400865737 ] from the cluster

Member added as learner at 19:24:50:

2022-12-15T19:24:50.597610435Z I1215 19:24:50.597552       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"16340e00-06fa-4526-9dc4-3da772936f35", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAddAsLearner' successfully added new member https://10.0.152.48:2380

Looking back into the clustermemberremoval controller and in particular the PR(openshift/cluster-etcd-operator#947) to relax the scale-down constraint for unhealthy members where I think we may have inadvertently relaxed this for a healthy member too. Need to confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for the fix to not allow scale-down or removal of the machine pending deletion before the new one is added openshift/cluster-etcd-operator#978

@hasbro17 hasbro17 force-pushed the update-scaling-test-to-use-cpms branch from 0345bcd to 610dde7 Compare January 16, 2023 21:57
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2023
@hasbro17 hasbro17 force-pushed the update-scaling-test-to-use-cpms branch from 610dde7 to 222ce3d Compare January 17, 2023 21:11
@hasbro17
Copy link
Contributor Author

Looks like the node never came up for the azure run. Retesting:
/test e2e-azure-ovn-etcd-scaling

@hasbro17
Copy link
Contributor Author

Okay so this finally works.

For CPMS supported platforms (AWS and Azure currently) the test is relying the CPMSO to create the machine while we're reverting back to the manual deletion-then-creation workflow for unsupported platforms (GCP, Vsphere).

The scaling test itself is passing on azure but tripping up on a familiar (albeit unresolved) issue:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27497/pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling/1615502526784737280


: [sig-arch] events should not repeat pathologically expand_less | 0s
-- | --
{  1 events happened too frequently  event happened 23 times, something is wrong: ns/openshift-authentication-operator deployment/authentication-operator - reason/OpenShiftAPICheckFailed "user.openshift.io.v1" failed with an attempt failed with statusCode = 503, err = the server is currently unable to handle the request result=reject }

This should be good to review now.

@hasbro17
Copy link
Contributor Author

Retesting but will eventually override the azure test if it keeps failing on the known pathological events case.

/test e2e-azure-ovn-etcd-scaling
/retest-required
/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2023
@hasbro17
Copy link
Contributor Author

/retest-required


if cpmsSupported {

// TODO: Add cleanup step to recover back to 3 running machines and members if the test fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still have serial jobs where this could become an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently no. But we may have other scaling related tests that may have to run serially in the future. But I'd prefer to leave that as a todo for when we actually have those.

err = errors.Wrap(err, "pre-test: failed to determine if ControlPlaneMachineSet is present")
o.Expect(err).ToNot(o.HaveOccurred())

if cpmsSupported {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can move that whole if code block into a dedicated test function? have the setup code above as an initializer for both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move the setup code inside BeforeEach() so we do the common setup before each It() spec block.

And I would have two It() spec blocks in that case which would both always run except one would always get skipped based on the conditional. Just to illustrate what I'm thinking:

g.Describe("[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling] etcd [apigroup:config.openshift.io]", func() {

	var cpmsActive bool
	// Plus all the other common setup variables to pass to the It() closures below

	g.BeforeEach(func() {
		// Common setup and compute cpmsActive
		cpmsActive := ...
	})


	g.It("is able to vertically scale up and down with a single node when CPMS is active [Timeout:60m][apigroup:machine.openshift.io]", func() {
		if cpmsActive {
			Skip("CPMS is inactive so cannot use CPMS")
		}
		
		// New CPMS based workflow

	}


	g.It("is able to vertically scale up and down with a single node when CPMS is inactive [Timeout:60m][apigroup:machine.openshift.io]", func() {
		if cpmsActive {
			Skip("CPMS is active so cannot manually create machines")
		}
		
		// Old workflow
	}
	
}

Grammatically this is a bit awkward and not sure the conditional is too clear here.

I'm no ginkgo expert but I would think something like the following reads better with a When or Describe block which are both equivalent:

g.When("CPMS is active", func() {

	g.It("is able to vertically scale up and down with a single node when CPMS is active [Timeout:60m][apigroup:machine.openshift.io]", func() {
	...
	}
}

g.When("CPMS is inactive", func() {

	g.It("is able to vertically scale up and down with a single node when CPMS is active [Timeout:60m][apigroup:machine.openshift.io]", func() {
	...
	}
}

But again I'm not sure how to conditionally run either When() block here so that's why I used It() in the former example.
But let me know if that's what you meant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking along the lines of

scalingtestinglibrary.SkipIfUnsupportedPlatform(ctx, oc)

so your first description fits the best. I just want to save some refactoring steps for ETCD-330 - where I would also need to detect whether CPMS active or not.

The g.When() indeed reads better, but if it's taking too much time for you to figure it out you can leave this for me :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I get the context for this now. I'm still not liking the former layout of skipping one of the two It() blocks and haven't reviewed https://github.com/openshift/origin/pull/27461/files enough to see if we can't make that an entirely separate test case for that.

The quorum checker test would even need to disable the CPMS as well (when active) so we would have to serialize or order that so it can clean up and not be disruptive.

In the interest of landing a smaller fix temporarily (to improve pass rates and unblock other debugging efforts), I would lean towards punting the refactor for when we actually have a second test case. But not to say it's your problem :) I will refactor the test to make it more Ginkgo-readable as a follow up.

This would be easier to backport to 4.12 as well in case we have to refactor again in ETCD-330. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and haven't reviewed https://github.com/openshift/origin/pull/27461/files enough to see if we can't make that an entirely separate test case for that.

save yourself the time, I'll likely write this again from scratch.

The quorum checker test would even need to disable the CPMS as well (when active) so we would have to serialize or order that so it can clean up and not be disruptive.

I personally would just have the checker test run when there is no CPMS.

I would lean towards punting the refactor for when we actually have a second test case. But not to say it's your problem :) I will refactor the test to make it more Ginkgo-readable as a follow up.

All good, I'm fine with refactoring this in the other ticket. Let's get the pass rates fixed first before we complicate.

}

// No need to be deterministic since it doesn't matter which machine we delete
machineToDelete = machineList.Items[0].Name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there might be two invocations of the wait.Poll, especially if the delete failed on some transient client/reponse error. So we could end up with two deleted masters here.

I would just remove the wait.Poll altogether and rely on the machineClient retries on a single chosen machine name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'm going to make it deterministic by always picking the master name with the lowest index since all machines are suffixed e.g ...-master-0 ...-master-1 etc.

remove the wait.Poll altogether and rely on the machineClient retries

The wait.Poll is for retrying, so not sure if you meant replacing it with another util? The machineClient won't auto retry. Or do you mean just loop the Delete until it goes through. Would need some backoff though in case of transient API errors which is what the wait.Poll does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, thank you. I thought that there's some retry utility baked into the clients already. As long as you pick the master deterministically it's fine :)

t.Logf("attempting to delete machine %q", machineToDelete)

if err := machineClient.Delete(ctx, machineToDelete, metav1.DeleteOptions{}); err != nil {
return isTransientAPIError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also regarding the polling, if it's already deleted this will error out. So I think we need another:

if apierrors.IsNotFound(err) {

For platforms where the ControlPlaneMachineSet is active and
being reconciled by the CPMSO, the vertical scaling test should rely on
the CPMSO to remove and add new machines, otherwise there is a race between
the test removing a machine and the CPMSO adding a new one.
@hasbro17 hasbro17 force-pushed the update-scaling-test-to-use-cpms branch from 222ce3d to 144c8aa Compare January 19, 2023 21:32
// The machine we just listed should be present but if not, error out
if apierrors.IsNotFound(err) {
t.Logf("machine %q was listed but not found or already deleted", machineToDelete)
return false, fmt.Errorf("machine %q was listed but not found or already deleted", machineToDelete)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be:

return true, null

after all, the expectation of the whole function is that you want to delete the machine? 🗡️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we want to delete the machine but having just listed the machine in L93 and then subsequently finding out the machine is no longer present means something has gone wrong.

Deleting the machine doesn't mean it should just go away immediately. You need the replacement machine to be created, added, and promoted as a member, and then have member removal and deletion hook removal etc before the machine should go away.

If at this point deleting the the machine results in an IsNotFound error then the test really can't verify anything anymore in above sequence for the vertical scaling workflow since the machine is already gone.

@hasbro17
Copy link
Contributor Author

/retest-required
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

Test is fine on vshpere (just the know disruption test failures again). Azure has passed with the new workflow before and looks to be an infra problem.

@dusk125
Copy link
Contributor

dusk125 commented Jan 25, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 46433a0 and 2 for PR HEAD 144c8aa in total

@hasbro17
Copy link
Contributor Author

Scaling jobs look good. Vsphere has passed before, and azure has the test passing but seeing other failures:

Thanos queriers not connected to all Prometheus sidecars: server_error: server error: 504

/retest-required

@tjungblu
Copy link
Contributor

/retest

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 64ba42b and 1 for PR HEAD 144c8aa in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 75cc70b and 0 for PR HEAD 144c8aa in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2023

@hasbro17: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-builds 8a609d9 link true /test e2e-gcp-ovn-builds
ci/prow/e2e-gcp-ovn-image-ecosystem 8a609d9 link true /test e2e-gcp-ovn-image-ecosystem
ci/prow/e2e-aws-ovn-image-registry 8a609d9 link true /test e2e-aws-ovn-image-registry
ci/prow/e2e-vsphere-ovn-etcd-scaling 144c8aa link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-openstack-ovn 144c8aa link false /test e2e-openstack-ovn
ci/prow/e2e-azure-ovn-etcd-scaling 144c8aa link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-single-node 144c8aa link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-single-node-upgrade 144c8aa link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 7e728ca into openshift:master Jan 27, 2023
@hasbro17
Copy link
Contributor Author

hasbro17 commented Jan 27, 2023

Needs a backport to 4.12 as well.

/cherry-pick release-4.12

Although I forgot to classify this as a 4.13 bug and now need a 4.13 verified and a 4.12 bug to depend on that now ☹️

@openshift-cherrypick-robot

@hasbro17: new pull request created: #27692

Details

In response to this:

Needs a backport to 4.12 as well.

/cherry-pick release-4.12

Although I'll forgot to classify this as a 4.13 bug and now need a 4.13 verified and a 4.12 bug to depend on that now ☹️

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. vendor-update Touching vendor dir or related files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants