Skip to content

Bug 1856270: allow users to manually delete machines stuck in crash loop#111

Closed
iamemilio wants to merge 1 commit intoopenshift:masterfrom
iamemilio:remove_delete_loop_finalizer
Closed

Bug 1856270: allow users to manually delete machines stuck in crash loop#111
iamemilio wants to merge 1 commit intoopenshift:masterfrom
iamemilio:remove_delete_loop_finalizer

Conversation

@iamemilio
Copy link
Copy Markdown

machines that fail to provision and cannot be deleted cause CAPO to get stuck in a very specific crash loop. We want to allow users to manually delete the machine when this happens.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 17, 2020
@openshift-ci-robot
Copy link
Copy Markdown

@iamemilio: This pull request references Bugzilla bug 1856270, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1856270: allow users to manually delete machines stuck in crash loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2020
Copy link
Copy Markdown

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand what is happening in this change, but i am curious what happens if a machine has actually been deleted, will we still try to delete the finalizer through this logic?

and if yes, is there an consequence of trying to update something that has been deleted?

@iamemilio
Copy link
Copy Markdown
Author

I dont think we can hit this code block if the machine has been deleted because before we get to this point, we check if the machine exists multiple times (im not sure why its that redundant) and exit the function if it does not. This should only trigger if we check that the machine exists, and it does, then try to delete it and get a "resource not found" error. I think there must be a bug in openstack that causes this case to occur.

@iamemilio
Copy link
Copy Markdown
Author

That being said, I have been unable to create a runnable release image for almost a week, and so I cannot really verify any of this.

@elmiko
Copy link
Copy Markdown

elmiko commented Aug 17, 2020

ack, thanks for the explanation @iamemilio, i can see that this code won't be reached unless the machine is not deleted. it makes sense now.

/lgtm

@openshift-ci-robot
Copy link
Copy Markdown

@elmiko: changing LGTM is restricted to collaborators

Details

In response to this:

ack, thanks for the explanation @iamemilio, i can see that this code won't be reached unless the machine is not deleted. it makes sense now.

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, iamemilio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

This patch is not the right approach in a variety of ways.

If we remove the finalizer, there will be no indication that there was a problem, the machine will silently go away. There will be nothing left to tell the user to go manually delete an instance from the cloud.

404 on delete needs to be handled by the actuator. General practice is to check if the instance exists in the actuator.Delete() call before attempt to delete the instance, and handle errors appropriately there.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2020
@elmiko
Copy link
Copy Markdown

elmiko commented Aug 17, 2020

solid advice, thanks Mike!

@mandre
Copy link
Copy Markdown
Member

mandre commented Aug 18, 2020

/hold

This patch is not the right approach in a variety of ways.

Not mentioning it's modifying the vendored MAO directly...

Thanks folks for the feedback and pointers to how GCP handles the same issue.

@elmiko
Copy link
Copy Markdown

elmiko commented Aug 18, 2020

Not mentioning it's modifying the vendored MAO directly...

ouch. can't believe i missed that =(

@iamemilio
Copy link
Copy Markdown
Author

iamemilio commented Aug 18, 2020

Not mentioning it's modifying the vendored MAO directly...

   ouch. can't believe i missed that =(

I cant believe I didnt notice that either haha

@iamemilio
Copy link
Copy Markdown
Author

@michaelgugino the issue with this bug is that the machine will show up when we check that it exists, but it will return "404 resource not found" when we try to delete it.

@iamemilio
Copy link
Copy Markdown
Author

Considering your comment on the bugzilla, this is not a viable solution. Closing

@iamemilio iamemilio closed this Aug 18, 2020
@openshift-ci-robot
Copy link
Copy Markdown

@iamemilio: This pull request references Bugzilla bug 1856270. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1856270: allow users to manually delete machines stuck in crash loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pierreprinetti pushed a commit to shiftstack/cluster-api-provider-openstack that referenced this pull request Apr 22, 2024
Switch yaml package from ghodss repo to sigs.k8s.io fork kubernetes-sigs#572
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants