Bug 1740933: pkg/destroy/aws: Terminate instances before other resources #2169

wking · 2019-08-06T22:18:10Z

Fix races like:

The installer removes a bucket
The still-running registry operator tries to self-heal and creates a new bucket, but before it can push tags...
The installer terminates the instance where the registry operator was running.
The installer leaks the new, untagged bucket.

This has been happening in CI in jobs like this one, where CloudFormation logs show:

12:56:32, registry operator creates the bucket
14:31:10, installer deletes the bucket
14:31:11, registry operator creates the bucket again
14:31:11, installer starts requesting instance termination

And it could theoretically happen to any resource (not just buckets) which we discover by tag, but which is not tagged atomically on creation.

abhinavdahiya · 2019-08-06T22:33:52Z

Fix races like:

1. The installer removes a bucket

2. The still-running registry operator tries to self-heal and creates a new bucket, but before it can push tags...

3. The installer terminates the instance where the registry operator was running.

4. The installer leaks the new, untagged bucket.

is the resourcetaggingapi lagging wrt to newly creately objects? because otherwise the sync loop doesn't stop unless it see no resources...
^^ in which case there shouldn't be a leak...

or is th registry operator creating new buckets without tags when it recreates.

abhinavdahiya · 2019-08-06T22:35:40Z

I believe it might be more to to with this check skipping resources that were previously deleted and recreated but had same ARN..
https://github.com/openshift/installer/blob/449039508ab174d9998ea63e8d3b62f9d4614894/pkg/destroy/aws/aws.go#L173

abhinavdahiya · 2019-08-08T13:52:18Z

pkg/destroy/aws/aws.go

resourcegrouptaggingspi loop already supports resource type filters.. I think it would be better if we use less number of loops that do or filtering themselves.. let changes of error.

Also this doesn't go into deleted so the resourcegrouptaggingspi loop might try to send deletes good these again..

resourcegrouptaggingspi loop already supports resource type filters..

Sharing code is great, but running DescribeInstancesPages in deleteEC2InstancesByTags saves us from calling DescribeInstances in deleteEC2Instance for each returned instance. Did you have other ideas for restructuring deleteEC2InstancesByTags to leverage existing code?

Also this doesn't go into deleted so the resourcegrouptaggingspi loop might try to send deletes good these again.

I've added 53a8954e36 to this PR to address that. But as I point out in that commit message, I think all that does is save us a DescribeInstances call (still useful). And if we want it for deleteEC2InstancesByTags, we might also want it for deleteEC2NATGatewaysByVPC and similar. It's probably a per-pathway call about whether saving a delete-gating describe call is worth the cost of plumbing through the deleted-ARNs data.

wking · 2019-08-13T21:41:57Z

/retitle Bug 1740933: pkg/destroy/aws: Terminate instances before other resources

openshift-ci-robot · 2019-08-13T21:42:03Z

@wking: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1740933: pkg/destroy/aws: Terminate instances before other resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-08-13T21:56:41Z

I believe it might be more to to with this check skipping resources that were previously deleted and recreated but had same ARN

That's a separate issue because in this case the replacement bucket was never tagged as cluster-owned. I've filed rhbz#1740935 to track it, and tried to distinguish more clearly between that and rhbz#1740933 which I'm targeting in this PR.

sdodson · 2019-08-29T14:55:38Z

/retest

wking · 2019-08-30T21:04:40Z

Rebased onto master and added a tag: prefix to the filter with 540872e04 -> a128b4350. That should resolve the hang in teardown.

wking · 2019-08-30T22:36:53Z

e2e-aws passed. e2e-aws-upgrade is outstanding, but that doesn't exercise the code in this PR. So I think we're all good to go here.

pkg/destroy/aws/aws.go

abhinavdahiya · 2019-09-03T22:34:50Z

pkg/destroy/aws/aws.go

Hmm, don't we INFO only deleted stuff?

Hmm, don't we INFO only deleted stuff?

We used to info-log Deleted when a termination request was accepted. This log line preserves that approach with a better string, because you can't actually delete instances, and we aren't waiting around to see AWS eventually reap the terminated instances.

but when do we log Terminated? You capture the terminated when ByTags, but if we get an instance by deleteARN you'll only see Terminating.. that's not ideal.

This log line preserves that approach with a better string, because you can't actually delete instances

but to be fair, the for all intents and purposes the instance will be delete at some time.. after the termination request was sent, so the Deleted wasn't wrong.

This log line preserves that approach with a better string, because you can't actually delete instances, and we aren't waiting around to see AWS eventually reap the terminated instances.

also in that case the deleteEC2InstanceByInstance isn't appropriate too..

also in that case the deleteEC2InstanceByInstance isn't appropriate too..

s/deleteEC2Instance/terminateEC2Instance across the file with 6af061afd -> 54cfed5.

but when do we log Terminated?

Never, with the deleteARN chain. I'm fine with that, because tracking these outside of terminateEC2InstancesByTags just so we can report the terminated state doesn't seem like it's worth the trouble to me. Do you think I need to work that in?

but to be fair, the for all intents and purposes the instance will be delete at some time...

Which is why I'm fine leaving things at Terminating when we don't have a reason to wait out the termination (like we do in terminateEC2InstancesByTags but which we don't with the deleteARN tree).

... after the termination request was sent, so the Deleted wasn't wrong.

I still think Terminating is better than Deleted for "my TerminateInstances call just went through". But yeah, it's splitting a bit of a hair. If you feel like I'm being overly picky, I can shift things back to Deleted.

Fix races like [1]: 1. The installer removes a bucket 2. The still-running registry operator tries to self-heal and creates a new bucket, but before it can push tags... 3. The installer terminates the instance where the registry operator was running. 4. The installer leaks the new, untagged bucket. This has been happening in CI in jobs like [2,3], where CloudFormation logs show: 1. 12:56:32, registry operator creates the bucket 2. 14:31:10, installer deletes the bucket 3. 14:31:11, registry operator creates the bucket again 4. 14:31:11, installer starts requesting instance termination And it could theoretically happen to any resource (not just buckets) which we discover by tag, but which is not tagged atomically on creation. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1740933 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454 [3]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Deleted arn=.*$instance\|s3$' time="2019-08-06T14:31:10Z" level=info msg=Deleted arn="arn:aws:s3:::image-registry-us-east-1-ciopjl76tqd5ff81b2fw4x-9d77aba0b84911" time="2019-08-06T14:31:11Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-042a8a77afcf9ab69" id=i-042a8a77afcf9ab69 time="2019-08-06T14:31:11Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-08e03cff5a3b35802" id=i-08e03cff5a3b35802 time="2019-08-06T14:31:13Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0abc61b48d7bdb810" id=i-0abc61b48d7bdb810 time="2019-08-06T14:31:17Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0a1a89a91ab4dbcb0" id=i-0a1a89a91ab4dbcb0 time="2019-08-06T14:31:23Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0c4c6241ea5ecccec" id=i-0c4c6241ea5ecccec time="2019-08-06T14:31:23Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-003abf1d84ecbea9f" id=i-003abf1d84ecbea9f

We might want this same sort of thing in deleteEC2NATGatewaysByVPC and our other functions where the caller doesn't know what resources the function might be deleting. But at the moment, Abhinav is only asking for a preemptive guard for doubled instance investigation [1]. The ARN syntax for instances is [2]: arn:aws:ec2:{region}:{account-id}:instance/{instance-id} and while we have to stretch into the ec2Client and reservation objects to find the region and account ID, constructing the ARN ourselves isn't that bad. Without this guard, any instances that terminateEC2InstancesByTags terminates but which still show up via the generic tag search will flow down through Run's deleteARN, through terminateEC2Instance, and end up in terminateEC2InstanceByInstance where terminated instances are silently skipped. So it doesn't save us all that much to return terminated instance ARNs, just the DescribeInstances call in terminateEC2Instance. [1]: openshift#2169 (comment) [2]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-ec2

abhinavdahiya · 2019-09-04T20:32:53Z

/lgtm

openshift-ci-robot · 2019-09-04T20:33:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2019-09-04T21:25:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-09-04T23:35:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-09-05T01:06:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2019-09-05T01:08:24Z

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1740933 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1740933: pkg/destroy/aws: Terminate instances before other resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-09-05T02:14:01Z

@wking: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`54cfed5`	link	`/test e2e-libvirt`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Partitions are more complicated than I thought when I wrote the hard-coded arn:aws:ec2:... prefix in 54cfed5 (pkg/destroy/aws/aws: Return ARNs from terminateEC2InstancesByTags, 2019-08-13, openshift#2169). While most of our regions use 'aws', China uses aws-cn [1], GovCloud uses aws-us-gov [2], and there may be more. Pull this ARN componet from the AWS SDK instead to track that sort of thing. The Gopkg.lock bump was generated with: $ dep ensure using: $ dep version dep: version : v0.5.1 build date : 2019-03-20 git hash : faa61893 go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false [1]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arns-syntax [2]: https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/#pkg-overview

I'd added this to save on AWS API-call volume in fc70eb2 (pkg/destroy/aws: Skip direct tag deletion for network interfaces, 2019-05-02, openshift#1704). But now that we support user-provided VPCs with installer-provided network interfaces, it's no longer feasible. Happily, the tagged network interfaces are just for installer-created control-plane machines, see 4fcdc1c (data/aws: explicitly create the network interface for masters to speed up DNS, 2019-03-04, openshift#1361). And since cf69c1e (pkg/destroy/aws: Terminate instances before other resources, 2019-08-06, openshift#2169) we are terminating instances first and waiting for their termination to complete before proceeding to other resources. So we should no longer have the long runs of failed network-interface deletion attempts while we wait for the instances to terminate, and this commit should make our removal more robust without increasing our API-call volume.

We might want this same sort of thing in deleteEC2NATGatewaysByVPC and our other functions where the caller doesn't know what resources the function might be deleting. But at the moment, Abhinav is only asking for a preemptive guard for doubled instance investigation [1]. The ARN syntax for instances is [2]: arn:aws:ec2:{region}:{account-id}:instance/{instance-id} and while we have to stretch into the ec2Client and reservation objects to find the region and account ID, constructing the ARN ourselves isn't that bad. Without this guard, any instances that terminateEC2InstancesByTags terminates but which still show up via the generic tag search will flow down through Run's deleteARN, through terminateEC2Instance, and end up in terminateEC2InstanceByInstance where terminated instances are silently skipped. So it doesn't save us all that much to return terminated instance ARNs, just the DescribeInstances call in terminateEC2Instance. [1]: openshift#2169 (comment) [2]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-ec2

Partitions are more complicated than I thought when I wrote the hard-coded arn:aws:ec2:... prefix in 54cfed5 (pkg/destroy/aws/aws: Return ARNs from terminateEC2InstancesByTags, 2019-08-13, openshift#2169). While most of our regions use 'aws', China uses aws-cn [1], GovCloud uses aws-us-gov [2], and there may be more. Pull this ARN componet from the AWS SDK instead to track that sort of thing. The Gopkg.lock bump was generated with: $ dep ensure using: $ dep version dep: version : v0.5.1 build date : 2019-03-20 git hash : faa61893 go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false [1]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arns-syntax [2]: https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/#pkg-overview

I'd added this to save on AWS API-call volume in fc70eb2 (pkg/destroy/aws: Skip direct tag deletion for network interfaces, 2019-05-02, openshift#1704). But now that we support user-provided VPCs with installer-provided network interfaces, it's no longer feasible. Happily, the tagged network interfaces are just for installer-created control-plane machines, see 4fcdc1c (data/aws: explicitly create the network interface for masters to speed up DNS, 2019-03-04, openshift#1361). And since cf69c1e (pkg/destroy/aws: Terminate instances before other resources, 2019-08-06, openshift#2169) we are terminating instances first and waiting for their termination to complete before proceeding to other resources. So we should no longer have the long runs of failed network-interface deletion attempts while we wait for the instances to terminate, and this commit should make our removal more robust without increasing our API-call volume.

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2019

openshift-ci-robot requested review from jhixson74 and patrickdillon August 6, 2019 22:18

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 6, 2019

wking mentioned this pull request Aug 6, 2019

Add Name tag to S3 bucket openshift/cluster-image-registry-operator#352

Merged

abhinavdahiya reviewed Aug 8, 2019

View reviewed changes

wking force-pushed the delete-aws-instances-first branch from 53a8954 to 44d8d02 Compare August 13, 2019 21:35

openshift-ci-robot changed the title ~~pkg/destroy/aws: Terminate instances before other resources~~ Bug 1740933: pkg/destroy/aws: Terminate instances before other resources Aug 13, 2019

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 13, 2019

wking force-pushed the delete-aws-instances-first branch from 44d8d02 to 540872e Compare August 13, 2019 21:42

wking force-pushed the delete-aws-instances-first branch from 540872e to a128b43 Compare August 30, 2019 21:04

wking mentioned this pull request Aug 30, 2019

pkg/destroy/aws: Skip direct tag deletion for network interfaces #1704

Merged

abhinavdahiya reviewed Sep 3, 2019

View reviewed changes

pkg/destroy/aws/aws.go Outdated Show resolved Hide resolved

abhinavdahiya reviewed Sep 3, 2019

View reviewed changes

wking force-pushed the delete-aws-instances-first branch from a128b43 to 6af061a Compare September 4, 2019 04:10

wking force-pushed the delete-aws-instances-first branch from 6af061a to aea4d3d Compare September 4, 2019 04:12

wking added 2 commits September 3, 2019 21:19

wking force-pushed the delete-aws-instances-first branch from aea4d3d to 54cfed5 Compare September 4, 2019 04:20

openshift-ci-robot assigned abhinavdahiya Sep 4, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 4, 2019

openshift-merge-robot merged commit 03ac8c9 into openshift:master Sep 5, 2019

wking deleted the delete-aws-instances-first branch September 5, 2019 02:31

wking mentioned this pull request Oct 3, 2019

pkg/destroy/aws: Dynamic partitions, fixing hard-coded 'aws' #2453

Merged

wking mentioned this pull request Oct 6, 2019

pkg/destroy/aws: Stop skipping direct network-interface deletion #2460

Merged

Bug 1740933: pkg/destroy/aws: Terminate instances before other resources #2169

Bug 1740933: pkg/destroy/aws: Terminate instances before other resources #2169

Uh oh!

Conversation

wking commented Aug 6, 2019

Uh oh!

abhinavdahiya commented Aug 6, 2019

Uh oh!

abhinavdahiya commented Aug 6, 2019

Uh oh!

abhinavdahiya Aug 8, 2019

Choose a reason for hiding this comment

Uh oh!

wking Aug 13, 2019

Choose a reason for hiding this comment

Uh oh!

wking commented Aug 13, 2019

Uh oh!

openshift-ci-robot commented Aug 13, 2019

Uh oh!

wking commented Aug 13, 2019

Uh oh!

sdodson commented Aug 29, 2019

Uh oh!

wking commented Aug 30, 2019

Uh oh!

wking commented Aug 30, 2019

Uh oh!

Uh oh!

abhinavdahiya Sep 3, 2019

Choose a reason for hiding this comment

Uh oh!

wking Sep 3, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Sep 3, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Sep 3, 2019

Choose a reason for hiding this comment

Uh oh!

wking Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Sep 4, 2019

Uh oh!

openshift-ci-robot commented Sep 4, 2019

Uh oh!

openshift-bot commented Sep 4, 2019

Uh oh!

openshift-bot commented Sep 4, 2019

Uh oh!

openshift-bot commented Sep 5, 2019

Uh oh!

openshift-ci-robot commented Sep 5, 2019

Uh oh!

openshift-ci-robot commented Sep 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants