-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Bug 1740933: pkg/destroy/aws: Terminate instances before other resources #2169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1740933: pkg/destroy/aws: Terminate instances before other resources #2169
Conversation
is the resourcetaggingapi lagging wrt to newly creately objects? because otherwise the sync loop doesn't stop unless it see no resources... or is th registry operator creating new buckets without tags when it recreates. |
|
I believe it might be more to to with this check skipping resources that were previously deleted and recreated but had same ARN.. |
pkg/destroy/aws/aws.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resourcegrouptaggingspi loop already supports resource type filters.. I think it would be better if we use less number of loops that do or filtering themselves.. let changes of error.
Also this doesn't go into deleted so the resourcegrouptaggingspi loop might try to send deletes good these again..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resourcegrouptaggingspi loop already supports resource type filters..
Sharing code is great, but running DescribeInstancesPages in deleteEC2InstancesByTags saves us from calling DescribeInstances in deleteEC2Instance for each returned instance. Did you have other ideas for restructuring deleteEC2InstancesByTags to leverage existing code?
Also this doesn't go into deleted so the resourcegrouptaggingspi loop might try to send deletes good these again.
I've added 53a8954e36 to this PR to address that. But as I point out in that commit message, I think all that does is save us a DescribeInstances call (still useful). And if we want it for deleteEC2InstancesByTags, we might also want it for deleteEC2NATGatewaysByVPC and similar. It's probably a per-pathway call about whether saving a delete-gating describe call is worth the cost of plumbing through the deleted-ARNs data.
53a8954 to
44d8d02
Compare
|
/retitle Bug 1740933: pkg/destroy/aws: Terminate instances before other resources |
|
@wking: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
44d8d02 to
540872e
Compare
That's a separate issue because in this case the replacement bucket was never tagged as cluster-owned. I've filed rhbz#1740935 to track it, and tried to distinguish more clearly between that and rhbz#1740933 which I'm targeting in this PR. |
|
/retest |
540872e to
a128b43
Compare
|
Rebased onto master and added a |
|
e2e-aws passed. e2e-aws-upgrade is outstanding, but that doesn't exercise the code in this PR. So I think we're all good to go here. |
pkg/destroy/aws/aws.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't we INFO only deleted stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't we INFO only
deletedstuff?
We used to info-log Deleted when a termination request was accepted. This log line preserves that approach with a better string, because you can't actually delete instances, and we aren't waiting around to see AWS eventually reap the terminated instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but when do we log Terminated? You capture the terminated when ByTags, but if we get an instance by deleteARN you'll only see Terminating.. that's not ideal.
This log line preserves that approach with a better string, because you can't actually delete instances
but to be fair, the for all intents and purposes the instance will be delete at some time.. after the termination request was sent, so the Deleted wasn't wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log line preserves that approach with a better string, because you can't actually delete instances, and we aren't waiting around to see AWS eventually reap the terminated instances.
also in that case the deleteEC2InstanceByInstance isn't appropriate too..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also in that case the
deleteEC2InstanceByInstanceisn't appropriate too..
s/deleteEC2Instance/terminateEC2Instance across the file with 6af061afd -> 54cfed5.
but when do we log
Terminated?
Never, with the deleteARN chain. I'm fine with that, because tracking these outside of terminateEC2InstancesByTags just so we can report the terminated state doesn't seem like it's worth the trouble to me. Do you think I need to work that in?
but to be fair, the for all intents and purposes the instance will be delete at some time...
Which is why I'm fine leaving things at Terminating when we don't have a reason to wait out the termination (like we do in terminateEC2InstancesByTags but which we don't with the deleteARN tree).
... after the termination request was sent, so the
Deletedwasn't wrong.
I still think Terminating is better than Deleted for "my TerminateInstances call just went through". But yeah, it's splitting a bit of a hair. If you feel like I'm being overly picky, I can shift things back to Deleted.
a128b43 to
6af061a
Compare
6af061a to
aea4d3d
Compare
Fix races like [1]: 1. The installer removes a bucket 2. The still-running registry operator tries to self-heal and creates a new bucket, but before it can push tags... 3. The installer terminates the instance where the registry operator was running. 4. The installer leaks the new, untagged bucket. This has been happening in CI in jobs like [2,3], where CloudFormation logs show: 1. 12:56:32, registry operator creates the bucket 2. 14:31:10, installer deletes the bucket 3. 14:31:11, registry operator creates the bucket again 4. 14:31:11, installer starts requesting instance termination And it could theoretically happen to any resource (not just buckets) which we discover by tag, but which is not tagged atomically on creation. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1740933 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454 [3]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Deleted arn=.*\(instance\|s3\)' time="2019-08-06T14:31:10Z" level=info msg=Deleted arn="arn:aws:s3:::image-registry-us-east-1-ciopjl76tqd5ff81b2fw4x-9d77aba0b84911" time="2019-08-06T14:31:11Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-042a8a77afcf9ab69" id=i-042a8a77afcf9ab69 time="2019-08-06T14:31:11Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-08e03cff5a3b35802" id=i-08e03cff5a3b35802 time="2019-08-06T14:31:13Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0abc61b48d7bdb810" id=i-0abc61b48d7bdb810 time="2019-08-06T14:31:17Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0a1a89a91ab4dbcb0" id=i-0a1a89a91ab4dbcb0 time="2019-08-06T14:31:23Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0c4c6241ea5ecccec" id=i-0c4c6241ea5ecccec time="2019-08-06T14:31:23Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-003abf1d84ecbea9f" id=i-003abf1d84ecbea9f
We might want this same sort of thing in deleteEC2NATGatewaysByVPC and
our other functions where the caller doesn't know what resources the
function might be deleting. But at the moment, Abhinav is only asking
for a preemptive guard for doubled instance investigation [1].
The ARN syntax for instances is [2]:
arn:aws:ec2:{region}:{account-id}:instance/{instance-id}
and while we have to stretch into the ec2Client and reservation
objects to find the region and account ID, constructing the ARN
ourselves isn't that bad.
Without this guard, any instances that terminateEC2InstancesByTags
terminates but which still show up via the generic tag search will
flow down through Run's deleteARN, through terminateEC2Instance, and
end up in terminateEC2InstanceByInstance where terminated instances
are silently skipped. So it doesn't save us all that much to return
terminated instance ARNs, just the DescribeInstances call in
terminateEC2Instance.
[1]: openshift#2169 (comment)
[2]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-ec2
aea4d3d to
54cfed5
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1740933 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Partitions are more complicated than I thought when I wrote the hard-coded arn:aws:ec2:... prefix in 54cfed5 (pkg/destroy/aws/aws: Return ARNs from terminateEC2InstancesByTags, 2019-08-13, openshift#2169). While most of our regions use 'aws', China uses aws-cn [1], GovCloud uses aws-us-gov [2], and there may be more. Pull this ARN componet from the AWS SDK instead to track that sort of thing. The Gopkg.lock bump was generated with: $ dep ensure using: $ dep version dep: version : v0.5.1 build date : 2019-03-20 git hash : faa61893 go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false [1]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arns-syntax [2]: https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/#pkg-overview
I'd added this to save on AWS API-call volume in fc70eb2 (pkg/destroy/aws: Skip direct tag deletion for network interfaces, 2019-05-02, openshift#1704). But now that we support user-provided VPCs with installer-provided network interfaces, it's no longer feasible. Happily, the tagged network interfaces are just for installer-created control-plane machines, see 4fcdc1c (data/aws: explicitly create the network interface for masters to speed up DNS, 2019-03-04, openshift#1361). And since cf69c1e (pkg/destroy/aws: Terminate instances before other resources, 2019-08-06, openshift#2169) we are terminating instances first and waiting for their termination to complete before proceeding to other resources. So we should no longer have the long runs of failed network-interface deletion attempts while we wait for the instances to terminate, and this commit should make our removal more robust without increasing our API-call volume.
We might want this same sort of thing in deleteEC2NATGatewaysByVPC and
our other functions where the caller doesn't know what resources the
function might be deleting. But at the moment, Abhinav is only asking
for a preemptive guard for doubled instance investigation [1].
The ARN syntax for instances is [2]:
arn:aws:ec2:{region}:{account-id}:instance/{instance-id}
and while we have to stretch into the ec2Client and reservation
objects to find the region and account ID, constructing the ARN
ourselves isn't that bad.
Without this guard, any instances that terminateEC2InstancesByTags
terminates but which still show up via the generic tag search will
flow down through Run's deleteARN, through terminateEC2Instance, and
end up in terminateEC2InstanceByInstance where terminated instances
are silently skipped. So it doesn't save us all that much to return
terminated instance ARNs, just the DescribeInstances call in
terminateEC2Instance.
[1]: openshift#2169 (comment)
[2]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-ec2
Partitions are more complicated than I thought when I wrote the hard-coded arn:aws:ec2:... prefix in 54cfed5 (pkg/destroy/aws/aws: Return ARNs from terminateEC2InstancesByTags, 2019-08-13, openshift#2169). While most of our regions use 'aws', China uses aws-cn [1], GovCloud uses aws-us-gov [2], and there may be more. Pull this ARN componet from the AWS SDK instead to track that sort of thing. The Gopkg.lock bump was generated with: $ dep ensure using: $ dep version dep: version : v0.5.1 build date : 2019-03-20 git hash : faa61893 go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false [1]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arns-syntax [2]: https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/#pkg-overview
I'd added this to save on AWS API-call volume in fc70eb2 (pkg/destroy/aws: Skip direct tag deletion for network interfaces, 2019-05-02, openshift#1704). But now that we support user-provided VPCs with installer-provided network interfaces, it's no longer feasible. Happily, the tagged network interfaces are just for installer-created control-plane machines, see 4fcdc1c (data/aws: explicitly create the network interface for masters to speed up DNS, 2019-03-04, openshift#1361). And since cf69c1e (pkg/destroy/aws: Terminate instances before other resources, 2019-08-06, openshift#2169) we are terminating instances first and waiting for their termination to complete before proceeding to other resources. So we should no longer have the long runs of failed network-interface deletion attempts while we wait for the instances to terminate, and this commit should make our removal more robust without increasing our API-call volume.
Fix races like:
This has been happening in CI in jobs like this one, where CloudFormation logs show:
And it could theoretically happen to any resource (not just buckets) which we discover by tag, but which is not tagged atomically on creation.