ETCD-336: add e2e replace unhealthy master machine #27496

Elbehery · 2022-10-25T14:16:28Z

This PR adds a e2e workflow for the automated replacement of unhealthy master machine

use BeforeEach, AfterEach to remove setup duplication
add logs and clear error msgs
rename vars and funcs
refactor provider client to be agnostic
wait after killing instance, till machine being failed, before deletion from api-server

Elbehery · 2022-10-25T14:16:33Z

/hold

Elbehery · 2022-10-25T14:17:35Z

@hasbro17 would you please check if the current procedure is correct ? ..

iiuc I need to fail a machine and expect a replacement to be created automatically after removing the unhealthy member from etcd cluster.. please correct me if i am wrong

Also we need to reuse the setup and cleanup in BeforeEach and AfterEach iiuc

Elbehery · 2022-10-25T14:29:20Z

/assign @hasbro17
/assign @JoelSpeed

Elbehery · 2022-10-25T16:01:17Z

tools/junitreport/pkg/parser/gotest/data_parser_test.go

 		{
-			name: "basic",
-			testLine: "ok  	package/name 0.160s",
+			name:         "basic",


this was generated by running

make update-gofmt

Elbehery · 2022-10-26T04:00:46Z

/retest-required

Elbehery · 2022-10-26T04:01:01Z

/hold cancel

Elbehery · 2022-10-26T04:14:04Z

/retest

Elbehery · 2022-10-26T04:20:29Z

shall we override single node failing ?

Elbehery · 2022-10-26T04:38:58Z

the e2e scenario needs highly available control plane setup. @hasbro17 would you kindly override the singleNode jobs ?

JoelSpeed · 2022-10-26T08:23:11Z

test/extended/etcd/helpers/helpers.go

+	//machineIndexToFail := rand.Intn(len(machineList.Items))
+	machineToFail = &machineList.Items[0]
+	// update victim machine's status
+	machineToFail.Status.NodeRef = nil


Does NodeRef need to be nil? What did you see when you didn't set this nil?

I am trying to make the node failed/unhealthy .. Isn't setting this to nil needed ?

I don't think so, if you simply set the phase to Failed, from a Machine API perspective that's considered to be unhealthy. It might be needed for your logic though, where you are trying to simulate that the Node is gone.

In this case, do you need to somehow break the etcd pod to make sure you have an unhealthy member?

Yeah that would be gr8 to make it unhealthy first, then the node is being deleted .. Any suggestion how to make the member unhealthy and the node failed ?

JoelSpeed · 2022-10-26T08:23:39Z

test/extended/etcd/vertical_scaling.go

+
+	// The following test covers replacing a failed master machine from 3 master nodes cluster.
+	// It starts by setting one of the master machine to be failed/unhealthy.
+	// The failed machine is expected to be replaced automatically.


User's must trigger this though

yes, thats why I make this API call

err = machineClient.Delete(ctx, machineName, metav1.DeleteOptions{}) o.Expect(err).ToNot(o.HaveOccurred()) framework.Logf("successfully deleted the machine %q from the API", machineName)

Ok cool, may just want to update this comment

sure, i am just trying to get success from make verify :D

JoelSpeed · 2022-10-26T08:25:01Z

test/extended/etcd/vertical_scaling.go

+		// make sure it can be run on the current platform
+		// TODO: shall we refactor this into `JustBeforeEach()`
+		scalingtestinglibrary.SkipIfUnsupportedPlatform(ctx, oc)


This will only work today on AWS, 4.12 onwards, is this accounting for that?

In later releases we may add additional automated set up, eg for Azure and GCP

I think so, isn't it @hasbro17

I added OpenStack and GCP to the skipped jobs @JoelSpeed

No it won't. You need to detect whether the platform supports the cluster CPMS and then skip. Responded in #27496 (comment)

Elbehery · 2022-10-26T10:50:16Z

/label tide/merge-method-squash

Elbehery · 2022-10-26T11:24:24Z

@JoelSpeed @hasbro17 Yesterday the ci successded .. yet checking the logs, i could not identify anything proves that a replacement master machine has been created .. would appreciate if you have a look also

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27496/pull-ci-openshift-origin-master-e2e-aws-ovn-fips/1585082007124185088

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27496/pull-ci-openshift-origin-master-e2e-aws-csi/1585082007048687616

Elbehery · 2022-10-26T13:05:11Z

/retest-required

Elbehery · 2022-10-26T13:20:49Z

/retest-required

Elbehery · 2022-11-07T13:25:35Z

currently awaiting #27497 and #27461

openshift-ci · 2022-11-07T16:02:27Z

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-ovn-image-ecosystem	`e6cda20`	link	true	`/test e2e-gcp-ovn-image-ecosystem`
ci/prow/e2e-aws-ovn-cgroupsv2	`2b56607`	link	false	`/test e2e-aws-ovn-cgroupsv2`
ci/prow/e2e-aws-ovn-single-node-upgrade	`2b56607`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-aws-ovn-serial	`2b56607`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/verify	`2b56607`	link	true	`/test verify`
ci/prow/e2e-aws-ovn-single-node-serial	`2b56607`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-openstack-ovn	`2b56607`	link	false	`/test e2e-openstack-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-merge-robot · 2022-11-10T08:45:41Z

@Elbehery: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2023-02-08T09:00:57Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Elbehery · 2023-02-09T01:51:27Z

@hasbro17 do we still need this ?

openshift-bot · 2023-03-11T08:30:46Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2023-04-11T00:00:40Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2023-04-11T00:01:28Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 25, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 25, 2022

openshift-ci bot requested review from csrwng and hasbro17 October 25, 2022 14:17

openshift-ci bot assigned hasbro17 and JoelSpeed Oct 25, 2022

Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch from 4356b37 to c165250 Compare October 25, 2022 15:59

Elbehery commented Oct 25, 2022

View reviewed changes

Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch 2 times, most recently from 0e75e01 to c33dffa Compare October 26, 2022 01:32

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2022

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2022

JoelSpeed reviewed Oct 26, 2022

View reviewed changes

Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch from c33dffa to bf5d575 Compare October 26, 2022 09:27

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2022

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 26, 2022

Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch 2 times, most recently from d5f7c77 to 884942a Compare October 26, 2022 11:09

Elbehery added 15 commits November 7, 2022 13:10

fail machine from aws provider

80ff780

tidy

8820596

increase timeout

dd01e59

extract instance-id

be7f0a8

add step assertion

200dd40

wait for failed phase

8681848

remove wait failed state

d2559e9

remove cm usage

2a7846d

go mod tidy

34050f8

go fmt

d6f01e3

recreate etcd client

30ac4d2

changes to go1.18

9ca592d

formatted imports

d42acfa

rebase

e11e63c

restore gcp openstack

2b56607

Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch from e59727b to 2b56607 Compare November 7, 2022 12:12

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2023

openshift-ci bot closed this Apr 11, 2023

Elbehery deleted the add_e2e_replace_unhealthy_master_machine branch September 30, 2024 03:47

ETCD-336: add e2e replace unhealthy master machine #27496

ETCD-336: add e2e replace unhealthy master machine #27496

Uh oh!

Conversation

Elbehery commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Elbehery commented Oct 25, 2022

Uh oh!

Elbehery commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Elbehery commented Oct 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Oct 26, 2022

Uh oh!

Elbehery commented Nov 7, 2022

Uh oh!

openshift-ci bot commented Nov 7, 2022

Uh oh!

openshift-merge-robot commented Nov 10, 2022

Uh oh!

openshift-bot commented Feb 8, 2023

Uh oh!

Elbehery commented Feb 9, 2023

Uh oh!

openshift-bot commented Mar 11, 2023

Uh oh!

openshift-bot commented Apr 11, 2023

Uh oh!

openshift-ci bot commented Apr 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Elbehery commented Oct 25, 2022 •

edited

Loading

Elbehery commented Oct 25, 2022 •

edited

Loading