-
Notifications
You must be signed in to change notification settings - Fork 4.8k
ETCD-336: add e2e replace unhealthy master machine #27496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD-336: add e2e replace unhealthy master machine #27496
Conversation
|
/hold |
|
@hasbro17 would you please check if the current procedure is correct ? .. iiuc I need to fail a machine and expect a replacement to be created automatically after removing the unhealthy member from etcd cluster.. please correct me if i am wrong Also we need to reuse the setup and cleanup in |
|
/assign @hasbro17 |
4356b37 to
c165250
Compare
| { | ||
| name: "basic", | ||
| testLine: "ok package/name 0.160s", | ||
| name: "basic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was generated by running
make update-gofmt
0e75e01 to
c33dffa
Compare
|
/retest-required |
|
/hold cancel |
|
/retest |
|
shall we override single node failing ? |
|
the e2e scenario needs highly available control plane setup. @hasbro17 would you kindly override the singleNode jobs ? |
| //machineIndexToFail := rand.Intn(len(machineList.Items)) | ||
| machineToFail = &machineList.Items[0] | ||
| // update victim machine's status | ||
| machineToFail.Status.NodeRef = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NodeRef need to be nil? What did you see when you didn't set this nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to make the node failed/unhealthy .. Isn't setting this to nil needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, if you simply set the phase to Failed, from a Machine API perspective that's considered to be unhealthy. It might be needed for your logic though, where you are trying to simulate that the Node is gone.
In this case, do you need to somehow break the etcd pod to make sure you have an unhealthy member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would be gr8 to make it unhealthy first, then the node is being deleted .. Any suggestion how to make the member unhealthy and the node failed ?
|
|
||
| // The following test covers replacing a failed master machine from 3 master nodes cluster. | ||
| // It starts by setting one of the master machine to be failed/unhealthy. | ||
| // The failed machine is expected to be replaced automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User's must trigger this though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thats why I make this API call
err = machineClient.Delete(ctx, machineName, metav1.DeleteOptions{})
o.Expect(err).ToNot(o.HaveOccurred())
framework.Logf("successfully deleted the machine %q from the API", machineName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool, may just want to update this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i am just trying to get success from make verify :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated :)
| // make sure it can be run on the current platform | ||
| // TODO: shall we refactor this into `JustBeforeEach()` | ||
| scalingtestinglibrary.SkipIfUnsupportedPlatform(ctx, oc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work today on AWS, 4.12 onwards, is this accounting for that?
In later releases we may add additional automated set up, eg for Azure and GCP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, isn't it @hasbro17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added OpenStack and GCP to the skipped jobs @JoelSpeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it won't. You need to detect whether the platform supports the cluster CPMS and then skip. Responded in #27496 (comment)
c33dffa to
bf5d575
Compare
|
/label tide/merge-method-squash |
d5f7c77 to
884942a
Compare
|
@JoelSpeed @hasbro17 Yesterday the ci successded .. yet checking the logs, i could not identify anything proves that a replacement master machine has been created .. would appreciate if you have a look also |
|
/retest-required |
1 similar comment
|
/retest-required |
e59727b to
2b56607
Compare
|
@Elbehery: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@Elbehery: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
@hasbro17 do we still need this ? |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR adds a e2e workflow for the automated replacement of unhealthy master machine
cc @hasbro17 @tjungblu @JoelSpeed
BeforeEach,AfterEachto remove setup duplicationfailed, before deletion from api-server