pkg/destroy/aws: Untag shared resources #2467

wking · 2019-10-07T20:01:43Z

Remove any kubernetes.io/cluster/{id}: shared tags, such as those that we'll use for bring-your-own subnet workflows (spun out of #2438). The 20-tag limit that leads to the loop is from here. We're unlikely to exceed 20 with just the VPC and subnets, but it seemed reasonable to plan for the future to avoid surprises if we grow this list going forward.

Including tagClients allows us to find and remove tags from Route 53 resources, which live in us-east-1 regardless of the rest of the cluster's region. More on that under "The us-east-1 business..." in e24c7dc (#1039). I have not added untag support for IAM resources (which are not supported by resourcegroupstaggingapi, more on that in e24c7dc too), but we could add untagging for them later if we want to support explicitly-tagged shared IAM resources.

I'm calculating the shared tag by looking for an existing kubernetes.io/cluster/{id}: owned tag. That will be fine for most cases, although the metadata structure theoretically allows folks to pass in multiple such tags with different IDs, or similar tags with values other than owned. I'm printing warn-level logs when that sort of thing happens, because I think it's likely that the metadata.json provider is broken in those cases. But I'm continuing on without returning an error, because we want destroy cluster to be pretty robust, and we can still safely remove shared keys for the one we have selected.

I'm also logging (with an info-level log) errors with retrieving tags or untagging, and but again continuing on without erroring out in those cases. Hopefully those errors are ephemeral and a future attempt will punch through and succeed. In future work, if these errors are common enough, we might consider distinguishing between error codes like the always-fatal ErrCodeInvalidParameterException and the ephemeral ErrCodeThrottledException and aborting on always-fatal errors.

The removed tracker guards against redundant untag attempts in the face of AWS eventual consistency (i.e. subsequent GetResources calls returning tags which had been removed by earlier UntagResources calls). But there is a risk that we miss removing a shared tag that a parallel process re-injects while we're running:

We detect a shared tag on resource A.
We remove the tag from resource A.
Someone else adds it back.
We detect the restored tag on resource A, but interpret it as an eventual-consistency thing and leave it alone.

I inserted removeSharedTags after terminateEC2InstancesByTags to avoid concerns about in-cluster instances adding those shared tags, although I think in-cluster components adding shared tags is unlikely. And destroy cluster isn't removing shared resources, so we're always vulnerable to parallel shared-tag injection. So I think guarding against redundant untag calls (and their AWS throttling cost) is worth the slightly increased exposure to shared-tag injection races.

I put removeSharedTags before the bulk of the infrastructure removal because we usually look to cluster resources to detect resources leaks and trigger repeat removal attempts. If we removed the shared tags last, a destroy cluster run that was killed after infrastructure removal but before shared-tag removal completed would leak the shared tags. There's nothing fundamental about keeping the non-tag infrastructure towards the end though, I'd also be ok with having removeSharedTags at the very end of ClusterUninstaller.Run.

wking · 2019-10-07T23:04:49Z

/assign @abhinavdahiya

All green (except for OpenStack, which I don't touch) :)

abhinavdahiya · 2019-10-08T04:41:45Z

So I think we should be careful in untagging resources marked shared for a cluster.

With "owned", when user creates resources it seems natural for the un-installer to delete that resource as the user made a choice.

With "shared", when users tags the resources, I am currently more inclined to saying we untag resources we tagged in the first place and user owns removing the shared tag from resources they marked shared.

WDYT?
would starting at minimum for shared and expanding moving on be a safer path?

wking · 2019-10-08T06:01:21Z

With "shared", when users tags the resources, I am currently more inclined to saying we untag resources we tagged in the first place and user owns removing the shared tag from resources they marked shared.

How would we distinguish? Just by resource type? I'm fine untagging across the board because cluster infra IDs are unique within an AWS account. And after a successful deletion that cluster will be gone. Why would a user want some shared tags to persist?

abhinavdahiya · 2019-10-08T16:34:13Z

With "shared", when users tags the resources, I am currently more inclined to saying we untag resources we tagged in the first place and user owns removing the shared tag from resources they marked shared.

How would we distinguish? Just by resource type?

the only resources we tag is subnet, so only them.

I'm fine untagging across the board because cluster infra IDs are unique within an AWS account. And after a successful deletion that cluster will be gone. Why would a user want some shared tags to persist?

IMO it is not about why, but about can we. Do we have the permission (allowed) to remove the tag in the first place. what if users have some external process to manage some common resources for their cluster and removing the tag would affect them..?

Because of this i think starting with what we know we tagged is more appropriate in my opinion, and expand to all the resources if we get requests from users to do so.

wking · 2019-10-08T16:45:29Z

the only resources we tag is subnet, so only them.

I was planning on tagging the VPC too, but that's just informative. We don't need it, and I guess it exposes us to to hitting the 50-tag limit on the VPC. Do we have a policy on whether we want to tag shared on all resources we consume (e.g. also the public Route 53 zone), some resources we consume (e.g. VPCs and subnets), or none of the resources we consume unless we actually need them tagged for some cloud-provider-ish purpose (just subnets)? I think there's some contention between the 50-tag limit (pushing for fewer tags) and consumer visibility (pushing for more tags).

abhinavdahiya · 2019-10-08T16:48:47Z

none of the resources we consume unless we actually need them tagged for some cloud-provider-ish purpose (just subnets)?

my preference is this ^^

wking · 2019-10-08T20:21:37Z

My impression from our stand-up discussion was that we are fine with this PR's blanket shared removal (from all resource types) and that we want to only tag shared resources when we need them to get a functional cluster (so currently just subnets). That means this PR should be good to go, and I'll reroll the tagging logic in the next PR in this series to only tag subnets. Am I getting that right?

wking · 2019-10-08T23:48:13Z

Rebased onto master with 8f517b752 -> d472beed9 so I can build on this PR and have both untagging and the Terraform stuff from #2438.

jstuever · 2019-10-11T21:12:24Z

IMHO, we shouldn't be tagging resources as shared... that should be implied when !owned.

abhinavdahiya · 2019-10-11T21:57:25Z

I put removeSharedTags before the bulk of the infrastructure removal because we usually look to cluster resources to detect resources leaks and trigger repeat removal attempts. If we removed the shared tags last, a destroy cluster run that was killed after infrastructure removal but before shared-tag removal completed would leak the shared tags. There's nothing fundamental about keeping the non-tag infrastructure towards the end though, I'd also be ok with having removeSharedTags at the very end of ClusterUninstaller.Run.

Since we tag our resources first during create to reserve them, i would think we would remove those tags at the end to mark everything owned is removed and we now take the reservation away.

abhinavdahiya · 2019-10-11T22:02:27Z

I'm calculating the shared tag by looking for an existing kubernetes.io/cluster/{id}: owned tag. That will be fine for most cases, although the metadata structure theoretically allows folks to pass in multiple such tags with different IDs, or similar tags with values other than owned. I'm printing warn-level logs when that sort of thing happens, because I think it's likely that the metadata.json provider is broken in those cases. But I'm continuing on without returning an error, because we want destroy cluster to be pretty robust, and we can still safely remove shared keys for the one we have selected.

So looking at the API of the type pkg/destroy/aws.ClusterUninstaller, it accepts OR tag filers, so i would assume this package to remove all corresponding shared tags for owned tags.

The metadata.json only comes into play when Creating an Uninstaller from the metadata.json.. and if we want to prevent the users to specify more than one owned tags we should perform that check and fail at that step IMO.

the current approach of silently ignoring/warning is not the best.

abhinavdahiya · 2019-10-11T22:02:50Z

Including tagClients allows us to find and remove tags from Route 53 resources, which live in us-east-1 regardless of the rest of the cluster's region. More on that under "The us-east-1 business..." in e24c7dc (#1039). I have not added untag support for IAM resources (which are not supported by resourcegroupstaggingapi, more on that in e24c7dc too), but we could add untagging for them later if we want to support explicitly-tagged shared IAM resources.

+1

wking · 2019-10-11T22:28:09Z

Since we tag our resources first during create to reserve them, i would think we would remove those tags at the end...

Rebased onto master and shifted to do this with d472beed9 -> 108e28319.

wking · 2019-10-11T22:29:09Z

... we shouldn't be tagging resources as shared...

We need to tag subnets shared, because that's how the kubelet's cloud provider decides where to place load balancers for LoadBalancer Services.

wking · 2019-10-11T22:44:29Z

... i would assume this package to remove all corresponding shared tags for owned tags...

Rerolled to do this with 108e28319 -> 855a746b6. I still warn for kubernetes.io/cluster/* tags in the filters if their value is not owned, and do not remove shared tags in those cases.

pkg/destroy/aws/aws.go

Remove any 'kubernetes.io/cluster/{id}: shared' tags, such as those that we'll use for bring-your-own subnet workflows. The 20-tag limit that leads to the loop is from [1]. We're unlikely to exceed 20 with just the subnets, but it seemed reasonable to plan for the future to avoid surprises if we grow this list going forward. Including tagClients allows us to find and remove tags from Route 53 resources, which live in us-east-1 regardless of the rest of the cluster's region. More on that under "The us-east-1 business..." in e24c7dc (pkg/destroy/aws: Use the resource-groups service for tag->ARN lookup, 2019-01-10, openshift#1039). I have not added untag support for IAM resources (which are not supported by resourcegroupstaggingapi, more on that in e24c7dc too), but we could add untagging for them later if we want to support explicitly-tagged shared IAM resources. The append([]T(nil), a...) business is a slice copy [2], so we can drain the stack in each of our loops while still having a full tagClients slice to feed into the next loop). I'm calculating the shared tag by looking for existing 'kubernetes.io/cluster/{id}: owned' tags. In most cases there will be only one, but the metadata structure theoretically allows folks to pass in multiple such tags with different IDs, or similar tags with values other than 'owned'. The logic I have here will remove 'shared' forms of any 'owned' tags, and will warn for non-owned tags just in case the user expects us to be removing 'shared' in those cases where we will not. But I'm continuing on without returning an error, because we want 'destroy cluster' to be pretty robust. I'm also logging (with an info-level log) errors with retrieving tags or untagging, and but again continuing on without erroring out in those cases. Hopefully those errors are ephemeral and a future attempt will punch through and succeed. In future work, if these errors are common enough, we might consider distinguishing between error codes like the always-fatal ErrCodeInvalidParameterException and the ephemeral ErrCodeThrottledException and aborting on always-fatal errors. The 'removed' tracker guards against redundant untag attempts in the face of AWS eventual consistency (i.e. subsequent GetResources calls returning tags which had been removed by earlier UntagResources calls). But there is a risk that we miss removing a 'shared' tag that a parallel process re-injects while we're running: 1. We detect a shared tag on resource A. 2. We remove the tag from resource A. 3. Someone else adds it back. 4. We detect the restored tag on resource A, but interpret it as an eventual-consistency thing and leave it alone. I inserted removeSharedTags at the end of the infrastructure removal to reflect creation where we tag shared before creating other infrastructure. This effectively removes our reservations once we no longer have any owned resources which could be consuming the shared resources. This might cause us to leak shared tags in cases where 'destroy cluster' is killed after infrastructure removal but before shared-tag removal completed. But there's not much you can do to backstop that short of looking for orphaned 'shared' tags and removing them via some custom reaper logic. [1]: https://docs.aws.amazon.com/sdk-for-go/api/service/resourcegroupstaggingapi/#UntagResourcesInput [2]: https://github.com/golang/go/wiki/SliceTricks#copy

abhinavdahiya · 2019-10-11T23:57:26Z

pkg/destroy/aws/aws.go

should we make this ERROR ?

should we make this ERROR ?

We're ignoring it, so I don't think so. I'm ok with warnings though, although if we did that this package has a few other infos we'd need to bump as well.

abhinavdahiya · 2019-10-11T23:59:36Z

There is a build failure, otherwise this is LGTM @wking

wking · 2019-10-12T00:04:58Z

There is a build failure...

Sigh. I fixed that, but then pushed my fixed branch to this repo instead of my fork. And now I can't remove it because branch protection. And admins able to remove shared-aws-untagging from this repo? Looks like a few others too:

$ git branch -a | grep origin/ | grep -v '/pr/\|/release-\|master'
  remotes/origin/jim-minter-patch-1
  remotes/origin/shared-aws-untagging
  remotes/origin/vsphere
  remotes/origin/vsphere_hostname_resolution

openshift-ci-robot · 2019-10-12T01:58:27Z

@wking: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-openstack	`60fb1a8`	link	`/test e2e-openstack`
ci/prow/e2e-aws-scaleup-rhel7	`60fb1a8`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2019-10-12T02:39:14Z

Green enough.

abhinavdahiya · 2019-10-12T03:15:35Z

/lgtm

openshift-ci-robot · 2019-10-12T03:15:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 7, 2019

wking force-pushed the shared-aws-untagging branch from 9fff12d to 8f517b7 Compare October 7, 2019 20:01

openshift-ci-robot requested review from jstuever and patrickdillon October 7, 2019 20:02

openshift-ci-robot assigned abhinavdahiya Oct 7, 2019

wking force-pushed the shared-aws-untagging branch from 8f517b7 to d472bee Compare October 8, 2019 23:47

This was referenced Oct 8, 2019

pkg: AWS shared-subnet handling in Go #2477

Merged

[WIP] Add platforms/aws/customer-provisioned-vpc-and-subnets openshift/enhancements#67

Closed

wking force-pushed the shared-aws-untagging branch from d472bee to 108e283 Compare October 11, 2019 22:27

wking force-pushed the shared-aws-untagging branch from 108e283 to 855a746 Compare October 11, 2019 22:43

abhinavdahiya reviewed Oct 11, 2019

View reviewed changes

pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved

wking force-pushed the shared-aws-untagging branch from 855a746 to df443da Compare October 11, 2019 23:41

abhinavdahiya reviewed Oct 11, 2019

View reviewed changes

wking force-pushed the shared-aws-untagging branch from df443da to 60fb1a8 Compare October 12, 2019 00:02

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 12, 2019

openshift-merge-robot merged commit 1405312 into openshift:master Oct 12, 2019

wking deleted the shared-aws-untagging branch October 12, 2019 03:54

pkg/destroy/aws: Untag shared resources #2467

pkg/destroy/aws: Untag shared resources #2467

Uh oh!

Conversation

wking commented Oct 7, 2019

Uh oh!

wking commented Oct 7, 2019

Uh oh!

abhinavdahiya commented Oct 8, 2019

Uh oh!

wking commented Oct 8, 2019

Uh oh!

abhinavdahiya commented Oct 8, 2019

Uh oh!

wking commented Oct 8, 2019

Uh oh!

abhinavdahiya commented Oct 8, 2019

Uh oh!

wking commented Oct 8, 2019

Uh oh!

wking commented Oct 8, 2019

Uh oh!

jstuever commented Oct 11, 2019

Uh oh!

abhinavdahiya commented Oct 11, 2019

Uh oh!

abhinavdahiya commented Oct 11, 2019

Uh oh!

abhinavdahiya commented Oct 11, 2019

Uh oh!

wking commented Oct 11, 2019

Uh oh!

wking commented Oct 11, 2019

Uh oh!

wking commented Oct 11, 2019

Uh oh!

Uh oh!

abhinavdahiya Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

wking Oct 12, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Oct 11, 2019

Uh oh!

wking commented Oct 12, 2019

Uh oh!

openshift-ci-robot commented Oct 12, 2019

Uh oh!

wking commented Oct 12, 2019

Uh oh!

abhinavdahiya commented Oct 12, 2019

Uh oh!

openshift-ci-robot commented Oct 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants