[full ci] Prevent vic-machine from deleting non cVMs#6679
[full ci] Prevent vic-machine from deleting non cVMs#6679andrewtchin wants to merge 3 commits intovmware:masterfrom
Conversation
47c8440 to
72fd963
Compare
lib/install/management/delete.go
Outdated
| } | ||
| if d.parentResourcepool != nil { | ||
| if d.parentResourcepool.Reference() == defaultrp.Reference() { | ||
| return fmt.Errorf("Targed VCH is in cluster's default resource pool. Refusing to delete it.") |
There was a problem hiding this comment.
Why would we not remove if it's in the cluster default? We won't install there, so it's been moved there...why not remove it?
There was a problem hiding this comment.
I thought this was a requirement, but that makes sense to me. I removed this
There was a problem hiding this comment.
Well..hell...looking at the issue I see why you'd think that was a requirement..I think the first item should have said "vic-machine create will fail if the target rp is the default rp"...was really thinking of adding a new validation rule. I'm not sure that's really needed...
I think we prevent the install into a cluster default RP by creating an RP on each deploy. Previously there was an issue that would (in some cases) result in a VCH deployed to the default RP, but I think since 1.2.1 that's fixed...
There was a problem hiding this comment.
yea you're right 👍
lib/install/management/delete.go
Outdated
| return err | ||
| } | ||
| // Assume the cVMs and RP have already been deleted | ||
| log.Warnf("Proceeding with delete of VCH due to --force") |
There was a problem hiding this comment.
Will there be later feedback / logging that no containerVMs have been removed? If not should that be stated here? i.e. "No container VMs found, but proceeding with delete of VCH due to use of --force"
IMO there's a possibility that the containerVMs exist else where, so they would be "orphaned". Just want to make sure the user understands that no other VMs will be removed...
There was a problem hiding this comment.
Good point. Added this
| err = errors.Errorf("Failed to fetch guest info of appliance vm: %s", err) | ||
| return false, err | ||
| } | ||
| extraconfig.Decode(extraconfig.MapSource(info), &cspec) |
There was a problem hiding this comment.
This requires a call to migrate to be version tolerant. While we've not updated the id field yet we should code defensively.
Alternative is that this function is changed to check for the presence of the version field itself: https://github.com/vmware/vic/blob/master/lib/config/executor/container_vm.go#L206
| log.Warnf("No container VMs found, but proceeding with delete of VCH due to --force") | ||
| err = nil | ||
| } | ||
| if d.parentResourcepool != nil { |
There was a problem hiding this comment.
Why is this check here instead of at point of use in DeleteVCHInstances where it was before?
There was a problem hiding this comment.
this is required since I use d.parentResourcepool above and that could be nil if the parent RP was deleted OOB
| // if container delete failed, do not remove anything else | ||
| log.Infof("Specify --force to force delete") | ||
| return err | ||
| d.parentResourcepool, err = d.getComputeResource(vmm, conf) |
There was a problem hiding this comment.
I'm not sure what adding this check here gains us - the reported error condition was having had the appliance moved without updating the ComputeResource reference in the configuration, therefore we ended up with d.parentResourcePool == thecluster - this change will not prevent that.
DeleteVCHInstances would return an error in its old form if no parent resource pool could be found.
There was a problem hiding this comment.
This and the next change below was to allow a VCH that is moved to another RP to be deleted if identified by moid (--id) when the original RP that the VCH was created in no longer exists. In this case parentResourcepool would previously have returned an error causing this to bail and not delete the VCH even though it is found by its moid
There was a problem hiding this comment.
d.parentResourcePool can have the following values:
a. moid of originally targeted resource pool (endpointVM may not still be in that pool)
b. current parent pool of the endpointVM
In either case we now delete children of that pool that can be identified as cVMs, then attempt to destroy that pool if it's empty.
I'm unsure what happens if the endpointVM is the last VM in the cluster and the pool returned is the root pool for the cluster. I do not know what happens if you call destroy on the root resource pool of a cluster, but to be safe I'd rather we never do so.
| if ok { | ||
| // child is vch; detach all attached disks so later removal of images is successful | ||
| // Do not delete a VCH in the target RP if it is not the target VCH | ||
| if child.Reference() != vmm.Reference() { |
There was a problem hiding this comment.
We should not be allowing multiple VCHs in a single resource pool - the pool is part of the VCH. If it's possible to deploy in this manner then a check should be added in the create path and we should look at preventing the appliance from being moved out of it's parent pool.
Not going to block this PR on it however.
There was a problem hiding this comment.
This was to address if the user OOB moved a VCH into another VCH's RP. I will open an issue to disable moving a VCH
ee88c38 to
c37c105
Compare
There was a problem hiding this comment.
We need to be more failsafe in this when it comes to the parent resource pool. I think we should remove the failover path for the actual parent pool - this code was added in #3116 but I'm unsure what specific scenario was being addressed with this path. We should also confirm that it's not possible to deploy the endpointVM directly into the root pool, and perhaps a sanity check to return nil if the ComputeResource[0] is the root pool.
Given this then for delete, if the parent pool is nil and --force is supplied, we do not attempt to delete children or delete the pool afterwards.
isContainerVM is okay, and used effectively for deleting the children.
lib/install/management/delete.go
Outdated
| return err | ||
| } | ||
| // Assume the cVMs and RP have already been deleted | ||
| log.Warnf("No container VMs found, but proceeding with delete of VCH due to --force") |
There was a problem hiding this comment.
I do not see a path by which both err and d.parentResourcepool are non-nil so this logic will simply skip the attempt to delete the VCH when we hit the d.parentResourcePool != nil condition.
The text about containerVMs is also misleading given we've not looked for containerVMs at this point.
| // if container delete failed, do not remove anything else | ||
| log.Infof("Specify --force to force delete") | ||
| return err | ||
| d.parentResourcepool, err = d.getComputeResource(vmm, conf) |
There was a problem hiding this comment.
d.parentResourcePool can have the following values:
a. moid of originally targeted resource pool (endpointVM may not still be in that pool)
b. current parent pool of the endpointVM
In either case we now delete children of that pool that can be identified as cVMs, then attempt to destroy that pool if it's empty.
I'm unsure what happens if the endpointVM is the last VM in the cluster and the pool returned is the root pool for the cluster. I do not know what happens if you call destroy on the root resource pool of a cluster, but to be safe I'd rather we never do so.
| 5. Delete the VM and RP to cleanup | ||
|
|
||
| ### Expected Outcome: | ||
| 1. All steps should succeed |
There was a problem hiding this comment.
We need another two tests:
- endpointVM is moved out of it's pool, pool is left alone
- endpointVM is moved into the root of the cluster and the original pool is deleted.
There was a problem hiding this comment.
Just to be clear, for 1) the endpointVM should be successfully deleted and the RP and its children should not be deleted
and for 2) the original pool should be deleted OOB after endpointVM is moved to the root pool then vic-machine delete should successfully delete the endpointVM?
Thanks
There was a problem hiding this comment.
- correct
- correct - and other VMs in the root pool should be left alone (which is what the current test asserts but worth being explicit)
|
Added the integration tests, working through them now.
|
There should be a follow up issue to look at disabling Move for the individual VMs. |
820279d to
451b14b
Compare
hickeng
left a comment
There was a problem hiding this comment.
Thanks for the repeated iteration and addressing my paranoia!
| } | ||
| // Can't find the RP VCH was created in to delete cVMs, continue anyway | ||
| log.Warnf("No container VMs found, but proceeding with delete of VCH due to --force") | ||
| err = nil |
There was a problem hiding this comment.
getComputeResource never returns non-nil for both rp and err. If you set err = nil here then we will still skip over the delete of the containerVMs (not sure why the method is called DeleteVCHInstances).
I don't know what that will mean for deleting images if they are still attached to cVMs - I suspect it will fail with errors about locked vmdks if any of the cVMs are powered on. I don't see that that will cause any other "interesting" behaviours like deleting non-VCH related things, but please sanity check me if you haven't looked at this path.
|
|
||
| # Delete with force | ||
| ${moid}= Get VM Moid %{VCH-NAME} | ||
| ${ret} ${output}= Run And Return Rc And Output bin/vic-machine-linux delete --target %{TEST_URL} --user %{TEST_USERNAME} --password=%{TEST_PASSWORD} --compute-resource=%{TEST_RESOURCE} --id ${moid} --force |
There was a problem hiding this comment.
As a note, if you're specifying --id then I don't think you need --compute-resource. I frequently use --id to override which VCH I'm deleting when working with nimbus or vmc.
|
|
||
|
|
||
| Delete VCH moved from its RP | ||
| Run Keyword If '%{HOST_TYPE}' == 'ESXi' Pass Execution Test skipped on ESX due to unable to move into RP |
There was a problem hiding this comment.
ESX does have resource pools and we should be able to move VMs between them. It's possible that govc is missing exposure of the appropriate command.
(I'm not sure how to view resource pools in the H5 client however).
451b14b to
71e19b4
Compare
68e9530 to
92cf5cd
Compare
|
Replacement PR in #6816 to get the build to run |
Fixes #6603