-
Notifications
You must be signed in to change notification settings - Fork 1.5k
pkg/destroy/aws: Untag shared resources #2467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/destroy/aws: Untag shared resources #2467
Conversation
9fff12d to
8f517b7
Compare
|
/assign @abhinavdahiya All green (except for OpenStack, which I don't touch) :) |
|
So I think we should be careful in untagging resources marked shared for a cluster. With "owned", when user creates resources it seems natural for the un-installer to delete that resource as the user made a choice. With "shared", when users tags the resources, I am currently more inclined to saying we untag resources we tagged in the first place and user owns removing the shared tag from resources they marked shared. WDYT? |
How would we distinguish? Just by resource type? I'm fine untagging across the board because cluster infra IDs are unique within an AWS account. And after a successful deletion that cluster will be gone. Why would a user want some |
the only resources we tag is subnet, so only them.
IMO it is not about why, but about can we. Do we have the permission (allowed) to remove the tag in the first place. what if users have some external process to manage some common resources for their cluster and removing the tag would affect them..? Because of this i think starting with what we know we tagged is more appropriate in my opinion, and expand to all the resources if we get requests from users to do so. |
I was planning on tagging the VPC too, but that's just informative. We don't need it, and I guess it exposes us to to hitting the 50-tag limit on the VPC. Do we have a policy on whether we want to tag |
my preference is this ^^ |
|
My impression from our stand-up discussion was that we are fine with this PR's blanket |
8f517b7 to
d472bee
Compare
|
Rebased onto master with 8f517b752 -> d472beed9 so I can build on this PR and have both untagging and the Terraform stuff from #2438. |
|
IMHO, we shouldn't be tagging resources as shared... that should be implied when !owned. |
Since we tag our resources first during create to reserve them, i would think we would remove those tags at the end to mark everything owned is removed and we now take the reservation away. |
So looking at the API of the type The metadata.json only comes into play when Creating an Uninstaller from the metadata.json.. and if we want to prevent the users to specify more than one the current approach of silently ignoring/warning is not the best. |
+1 |
d472bee to
108e283
Compare
Rebased onto master and shifted to do this with d472beed9 -> 108e28319. |
We need to tag subnets |
108e283 to
855a746
Compare
Rerolled to do this with 108e28319 -> 855a746b6. I still warn for |
855a746 to
df443da
Compare
Remove any 'kubernetes.io/cluster/{id}: shared' tags, such as those
that we'll use for bring-your-own subnet workflows. The 20-tag limit
that leads to the loop is from [1]. We're unlikely to exceed 20 with
just the subnets, but it seemed reasonable to plan for the future to
avoid surprises if we grow this list going forward.
Including tagClients allows us to find and remove tags from Route 53
resources, which live in us-east-1 regardless of the rest of the
cluster's region. More on that under "The us-east-1 business..." in
e24c7dc (pkg/destroy/aws: Use the resource-groups service for
tag->ARN lookup, 2019-01-10, openshift#1039). I have not added untag support
for IAM resources (which are not supported by
resourcegroupstaggingapi, more on that in e24c7dc too), but we
could add untagging for them later if we want to support
explicitly-tagged shared IAM resources.
The append([]T(nil), a...) business is a slice copy [2], so we can
drain the stack in each of our loops while still having a full
tagClients slice to feed into the next loop).
I'm calculating the shared tag by looking for existing
'kubernetes.io/cluster/{id}: owned' tags. In most cases there will be
only one, but the metadata structure theoretically allows folks to
pass in multiple such tags with different IDs, or similar tags with
values other than 'owned'. The logic I have here will remove 'shared'
forms of any 'owned' tags, and will warn for non-owned tags just in
case the user expects us to be removing 'shared' in those cases where
we will not. But I'm continuing on without returning an error,
because we want 'destroy cluster' to be pretty robust.
I'm also logging (with an info-level log) errors with retrieving tags
or untagging, and but again continuing on without erroring out in
those cases. Hopefully those errors are ephemeral and a future
attempt will punch through and succeed. In future work, if these
errors are common enough, we might consider distinguishing between
error codes like the always-fatal ErrCodeInvalidParameterException and
the ephemeral ErrCodeThrottledException and aborting on always-fatal
errors.
The 'removed' tracker guards against redundant untag attempts in the
face of AWS eventual consistency (i.e. subsequent GetResources calls
returning tags which had been removed by earlier UntagResources
calls). But there is a risk that we miss removing a 'shared' tag that
a parallel process re-injects while we're running:
1. We detect a shared tag on resource A.
2. We remove the tag from resource A.
3. Someone else adds it back.
4. We detect the restored tag on resource A, but interpret it as an
eventual-consistency thing and leave it alone.
I inserted removeSharedTags at the end of the infrastructure removal
to reflect creation where we tag shared before creating other
infrastructure. This effectively removes our reservations once we no
longer have any owned resources which could be consuming the shared
resources. This might cause us to leak shared tags in cases where
'destroy cluster' is killed after infrastructure removal but before
shared-tag removal completed. But there's not much you can do to
backstop that short of looking for orphaned 'shared' tags and removing
them via some custom reaper logic.
[1]: https://docs.aws.amazon.com/sdk-for-go/api/service/resourcegroupstaggingapi/#UntagResourcesInput
[2]: https://github.com/golang/go/wiki/SliceTricks#copy
pkg/destroy/aws/aws.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this ERROR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this ERROR ?
We're ignoring it, so I don't think so. I'm ok with warnings though, although if we did that this package has a few other infos we'd need to bump as well.
|
There is a build failure, otherwise this is LGTM @wking |
df443da to
60fb1a8
Compare
Sigh. I fixed that, but then pushed my fixed branch to this repo instead of my fork. And now I can't remove it because branch protection. And admins able to remove $ git branch -a | grep origin/ | grep -v '/pr/\|/release-\|master'
remotes/origin/jim-minter-patch-1
remotes/origin/shared-aws-untagging
remotes/origin/vsphere
remotes/origin/vsphere_hostname_resolution |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Green enough. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Remove any
kubernetes.io/cluster/{id}: sharedtags, such as those that we'll use for bring-your-own subnet workflows (spun out of #2438). The 20-tag limit that leads to the loop is from here. We're unlikely to exceed 20 with just the VPC and subnets, but it seemed reasonable to plan for the future to avoid surprises if we grow this list going forward.Including
tagClientsallows us to find and remove tags from Route 53 resources, which live in us-east-1 regardless of the rest of the cluster's region. More on that under "The us-east-1 business..." in e24c7dc (#1039). I have not added untag support for IAM resources (which are not supported byresourcegroupstaggingapi, more on that in e24c7dc too), but we could add untagging for them later if we want to support explicitly-tagged shared IAM resources.I'm calculating the shared tag by looking for an existing
kubernetes.io/cluster/{id}: ownedtag. That will be fine for most cases, although the metadata structure theoretically allows folks to pass in multiple such tags with different IDs, or similar tags with values other thanowned. I'm printing warn-level logs when that sort of thing happens, because I think it's likely that themetadata.jsonprovider is broken in those cases. But I'm continuing on without returning an error, because we wantdestroy clusterto be pretty robust, and we can still safely removesharedkeys for the one we have selected.I'm also logging (with an info-level log) errors with retrieving tags or untagging, and but again continuing on without erroring out in those cases. Hopefully those errors are ephemeral and a future attempt will punch through and succeed. In future work, if these errors are common enough, we might consider distinguishing between error codes like the always-fatal
ErrCodeInvalidParameterExceptionand the ephemeralErrCodeThrottledExceptionand aborting on always-fatal errors.The
removedtracker guards against redundant untag attempts in the face of AWS eventual consistency (i.e. subsequentGetResourcescalls returning tags which had been removed by earlierUntagResourcescalls). But there is a risk that we miss removing asharedtag that a parallel process re-injects while we're running:I inserted
removeSharedTagsafterterminateEC2InstancesByTagsto avoid concerns about in-cluster instances adding those shared tags, although I think in-cluster components adding shared tags is unlikely. Anddestroy clusterisn't removing shared resources, so we're always vulnerable to parallel shared-tag injection. So I think guarding against redundant untag calls (and their AWS throttling cost) is worth the slightly increased exposure to shared-tag injection races.I put
removeSharedTagsbefore the bulk of the infrastructure removal because we usually look to cluster resources to detect resources leaks and trigger repeat removal attempts. If we removed the shared tags last, adestroy clusterrun that was killed after infrastructure removal but before shared-tag removal completed would leak the shared tags. There's nothing fundamental about keeping the non-tag infrastructure towards the end though, I'd also be ok with havingremoveSharedTagsat the very end ofClusterUninstaller.Run.