Skip to content

Conversation

@Elbehery
Copy link
Contributor

@Elbehery Elbehery commented Oct 25, 2022

This PR adds a e2e workflow for the automated replacement of unhealthy master machine

cc @hasbro17 @tjungblu @JoelSpeed

  • use BeforeEach, AfterEach to remove setup duplication
  • add logs and clear error msgs
  • rename vars and funcs
  • refactor provider client to be agnostic
  • wait after killing instance, till machine being failed, before deletion from api-server

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 25, 2022
@Elbehery
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 25, 2022
@openshift-ci openshift-ci bot requested review from csrwng and hasbro17 October 25, 2022 14:17
@Elbehery
Copy link
Contributor Author

Elbehery commented Oct 25, 2022

@hasbro17 would you please check if the current procedure is correct ? ..

iiuc I need to fail a machine and expect a replacement to be created automatically after removing the unhealthy member from etcd cluster.. please correct me if i am wrong

Also we need to reuse the setup and cleanup in BeforeEach and AfterEach iiuc

@Elbehery
Copy link
Contributor Author

/assign @hasbro17
/assign @JoelSpeed

@Elbehery Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch from 4356b37 to c165250 Compare October 25, 2022 15:59
{
name: "basic",
testLine: "ok package/name 0.160s",
name: "basic",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was generated by running

make update-gofmt

@Elbehery Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch 2 times, most recently from 0e75e01 to c33dffa Compare October 26, 2022 01:32
@Elbehery
Copy link
Contributor Author

/retest-required

@Elbehery
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2022
@Elbehery
Copy link
Contributor Author

/retest

@Elbehery
Copy link
Contributor Author

shall we override single node failing ?

@Elbehery
Copy link
Contributor Author

the e2e scenario needs highly available control plane setup. @hasbro17 would you kindly override the singleNode jobs ?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2022
//machineIndexToFail := rand.Intn(len(machineList.Items))
machineToFail = &machineList.Items[0]
// update victim machine's status
machineToFail.Status.NodeRef = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NodeRef need to be nil? What did you see when you didn't set this nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to make the node failed/unhealthy .. Isn't setting this to nil needed ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, if you simply set the phase to Failed, from a Machine API perspective that's considered to be unhealthy. It might be needed for your logic though, where you are trying to simulate that the Node is gone.

In this case, do you need to somehow break the etcd pod to make sure you have an unhealthy member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be gr8 to make it unhealthy first, then the node is being deleted .. Any suggestion how to make the member unhealthy and the node failed ?


// The following test covers replacing a failed master machine from 3 master nodes cluster.
// It starts by setting one of the master machine to be failed/unhealthy.
// The failed machine is expected to be replaced automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User's must trigger this though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thats why I make this API call

                 err = machineClient.Delete(ctx, machineName, metav1.DeleteOptions{})
		o.Expect(err).ToNot(o.HaveOccurred())
		framework.Logf("successfully deleted the machine %q from the API", machineName)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool, may just want to update this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i am just trying to get success from make verify :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated :)

// make sure it can be run on the current platform
// TODO: shall we refactor this into `JustBeforeEach()`
scalingtestinglibrary.SkipIfUnsupportedPlatform(ctx, oc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work today on AWS, 4.12 onwards, is this accounting for that?

In later releases we may add additional automated set up, eg for Azure and GCP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, isn't it @hasbro17

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added OpenStack and GCP to the skipped jobs @JoelSpeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't. You need to detect whether the platform supports the cluster CPMS and then skip. Responded in #27496 (comment)

@Elbehery Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch from c33dffa to bf5d575 Compare October 26, 2022 09:27
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2022
@Elbehery
Copy link
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 26, 2022
@Elbehery Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch 2 times, most recently from d5f7c77 to 884942a Compare October 26, 2022 11:09
@Elbehery
Copy link
Contributor Author

@JoelSpeed @hasbro17 Yesterday the ci successded .. yet checking the logs, i could not identify anything proves that a replacement master machine has been created .. would appreciate if you have a look also

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27496/pull-ci-openshift-origin-master-e2e-aws-ovn-fips/1585082007124185088

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27496/pull-ci-openshift-origin-master-e2e-aws-csi/1585082007048687616

@Elbehery
Copy link
Contributor Author

/retest-required

1 similar comment
@Elbehery
Copy link
Contributor Author

/retest-required

@Elbehery Elbehery force-pushed the add_e2e_replace_unhealthy_master_machine branch from e59727b to 2b56607 Compare November 7, 2022 12:12
@Elbehery
Copy link
Contributor Author

Elbehery commented Nov 7, 2022

currently awaiting #27497 and #27461

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 7, 2022

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-image-ecosystem e6cda20 link true /test e2e-gcp-ovn-image-ecosystem
ci/prow/e2e-aws-ovn-cgroupsv2 2b56607 link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-aws-ovn-single-node-upgrade 2b56607 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-serial 2b56607 link true /test e2e-aws-ovn-serial
ci/prow/verify 2b56607 link true /test verify
ci/prow/e2e-aws-ovn-single-node-serial 2b56607 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-openstack-ovn 2b56607 link false /test e2e-openstack-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2022
@openshift-merge-robot
Copy link
Contributor

@Elbehery: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@Elbehery
Copy link
Contributor Author

Elbehery commented Feb 9, 2023

@hasbro17 do we still need this ?

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2023
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Apr 11, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 11, 2023

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Elbehery Elbehery deleted the add_e2e_replace_unhealthy_master_machine branch September 30, 2024 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants