Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 6, 2019

Fix races like:

  1. The installer removes a bucket
  2. The still-running registry operator tries to self-heal and creates a new bucket, but before it can push tags...
  3. The installer terminates the instance where the registry operator was running.
  4. The installer leaks the new, untagged bucket.

This has been happening in CI in jobs like this one, where CloudFormation logs show:

  1. 12:56:32, registry operator creates the bucket
  2. 14:31:10, installer deletes the bucket
  3. 14:31:11, registry operator creates the bucket again
  4. 14:31:11, installer starts requesting instance termination

And it could theoretically happen to any resource (not just buckets) which we discover by tag, but which is not tagged atomically on creation.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2019
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 6, 2019
@abhinavdahiya
Copy link
Contributor

Fix races like:

1. The installer removes a bucket

2. The still-running registry operator tries to self-heal and creates a new bucket, but before it can push tags...

3. The installer terminates the instance where the registry operator was running.

4. The installer leaks the new, untagged bucket.

is the resourcetaggingapi lagging wrt to newly creately objects? because otherwise the sync loop doesn't stop unless it see no resources...
^^ in which case there shouldn't be a leak...

or is th registry operator creating new buckets without tags when it recreates.

@abhinavdahiya
Copy link
Contributor

I believe it might be more to to with this check skipping resources that were previously deleted and recreated but had same ARN..
https://github.com/openshift/installer/blob/449039508ab174d9998ea63e8d3b62f9d4614894/pkg/destroy/aws/aws.go#L173

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourcegrouptaggingspi loop already supports resource type filters.. I think it would be better if we use less number of loops that do or filtering themselves.. let changes of error.

Also this doesn't go into deleted so the resourcegrouptaggingspi loop might try to send deletes good these again..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourcegrouptaggingspi loop already supports resource type filters..

Sharing code is great, but running DescribeInstancesPages in deleteEC2InstancesByTags saves us from calling DescribeInstances in deleteEC2Instance for each returned instance. Did you have other ideas for restructuring deleteEC2InstancesByTags to leverage existing code?

Also this doesn't go into deleted so the resourcegrouptaggingspi loop might try to send deletes good these again.

I've added 53a8954e36 to this PR to address that. But as I point out in that commit message, I think all that does is save us a DescribeInstances call (still useful). And if we want it for deleteEC2InstancesByTags, we might also want it for deleteEC2NATGatewaysByVPC and similar. It's probably a per-pathway call about whether saving a delete-gating describe call is worth the cost of plumbing through the deleted-ARNs data.

@wking wking force-pushed the delete-aws-instances-first branch from 53a8954 to 44d8d02 Compare August 13, 2019 21:35
@wking
Copy link
Member Author

wking commented Aug 13, 2019

/retitle Bug 1740933: pkg/destroy/aws: Terminate instances before other resources

@openshift-ci-robot openshift-ci-robot changed the title pkg/destroy/aws: Terminate instances before other resources Bug 1740933: pkg/destroy/aws: Terminate instances before other resources Aug 13, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1740933: pkg/destroy/aws: Terminate instances before other resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 13, 2019
@wking wking force-pushed the delete-aws-instances-first branch from 44d8d02 to 540872e Compare August 13, 2019 21:42
@wking
Copy link
Member Author

wking commented Aug 13, 2019

I believe it might be more to to with this check skipping resources that were previously deleted and recreated but had same ARN

That's a separate issue because in this case the replacement bucket was never tagged as cluster-owned. I've filed rhbz#1740935 to track it, and tried to distinguish more clearly between that and rhbz#1740933 which I'm targeting in this PR.

@sdodson
Copy link
Member

sdodson commented Aug 29, 2019

/retest

@wking wking force-pushed the delete-aws-instances-first branch from 540872e to a128b43 Compare August 30, 2019 21:04
@wking
Copy link
Member Author

wking commented Aug 30, 2019

Rebased onto master and added a tag: prefix to the filter with 540872e04 -> a128b4350. That should resolve the hang in teardown.

@wking
Copy link
Member Author

wking commented Aug 30, 2019

e2e-aws passed. e2e-aws-upgrade is outstanding, but that doesn't exercise the code in this PR. So I think we're all good to go here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't we INFO only deleted stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't we INFO only deleted stuff?

We used to info-log Deleted when a termination request was accepted. This log line preserves that approach with a better string, because you can't actually delete instances, and we aren't waiting around to see AWS eventually reap the terminated instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but when do we log Terminated? You capture the terminated when ByTags, but if we get an instance by deleteARN you'll only see Terminating.. that's not ideal.

This log line preserves that approach with a better string, because you can't actually delete instances

but to be fair, the for all intents and purposes the instance will be delete at some time.. after the termination request was sent, so the Deleted wasn't wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line preserves that approach with a better string, because you can't actually delete instances, and we aren't waiting around to see AWS eventually reap the terminated instances.

also in that case the deleteEC2InstanceByInstance isn't appropriate too..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also in that case the deleteEC2InstanceByInstance isn't appropriate too..

s/deleteEC2Instance/terminateEC2Instance across the file with 6af061afd -> 54cfed5.

but when do we log Terminated?

Never, with the deleteARN chain. I'm fine with that, because tracking these outside of terminateEC2InstancesByTags just so we can report the terminated state doesn't seem like it's worth the trouble to me. Do you think I need to work that in?

but to be fair, the for all intents and purposes the instance will be delete at some time...

Which is why I'm fine leaving things at Terminating when we don't have a reason to wait out the termination (like we do in terminateEC2InstancesByTags but which we don't with the deleteARN tree).

... after the termination request was sent, so the Deleted wasn't wrong.

I still think Terminating is better than Deleted for "my TerminateInstances call just went through". But yeah, it's splitting a bit of a hair. If you feel like I'm being overly picky, I can shift things back to Deleted.

@wking wking force-pushed the delete-aws-instances-first branch from a128b43 to 6af061a Compare September 4, 2019 04:10
@wking wking force-pushed the delete-aws-instances-first branch from 6af061a to aea4d3d Compare September 4, 2019 04:12
Fix races like [1]:

1. The installer removes a bucket
2. The still-running registry operator tries to self-heal and creates
   a new bucket, but before it can push tags...
3. The installer terminates the instance where the registry operator
   was running.
4. The installer leaks the new, untagged bucket.

This has been happening in CI in jobs like [2,3], where CloudFormation
logs show:

1. 12:56:32, registry operator creates the bucket
2. 14:31:10, installer deletes the bucket
3. 14:31:11, registry operator creates the bucket again
4. 14:31:11, installer starts requesting instance termination

And it could theoretically happen to any resource (not just buckets)
which we discover by tag, but which is not tagged atomically on
creation.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1740933
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454
[3]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Deleted arn=.*\(instance\|s3\)'
     time="2019-08-06T14:31:10Z" level=info msg=Deleted arn="arn:aws:s3:::image-registry-us-east-1-ciopjl76tqd5ff81b2fw4x-9d77aba0b84911"
     time="2019-08-06T14:31:11Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-042a8a77afcf9ab69" id=i-042a8a77afcf9ab69
     time="2019-08-06T14:31:11Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-08e03cff5a3b35802" id=i-08e03cff5a3b35802
     time="2019-08-06T14:31:13Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0abc61b48d7bdb810" id=i-0abc61b48d7bdb810
     time="2019-08-06T14:31:17Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0a1a89a91ab4dbcb0" id=i-0a1a89a91ab4dbcb0
     time="2019-08-06T14:31:23Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-0c4c6241ea5ecccec" id=i-0c4c6241ea5ecccec
     time="2019-08-06T14:31:23Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:instance/i-003abf1d84ecbea9f" id=i-003abf1d84ecbea9f
We might want this same sort of thing in deleteEC2NATGatewaysByVPC and
our other functions where the caller doesn't know what resources the
function might be deleting.  But at the moment, Abhinav is only asking
for a preemptive guard for doubled instance investigation [1].

The ARN syntax for instances is [2]:

  arn:aws:ec2:{region}:{account-id}:instance/{instance-id}

and while we have to stretch into the ec2Client and reservation
objects to find the region and account ID, constructing the ARN
ourselves isn't that bad.

Without this guard, any instances that terminateEC2InstancesByTags
terminates but which still show up via the generic tag search will
flow down through Run's deleteARN, through terminateEC2Instance, and
end up in terminateEC2InstanceByInstance where terminated instances
are silently skipped.  So it doesn't save us all that much to return
terminated instance ARNs, just the DescribeInstances call in
terminateEC2Instance.

[1]: openshift#2169 (comment)
[2]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-ec2
@wking wking force-pushed the delete-aws-instances-first branch from aea4d3d to 54cfed5 Compare September 4, 2019 04:20
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 4, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 03ac8c9 into openshift:master Sep 5, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1740933 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1740933: pkg/destroy/aws: Terminate instances before other resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 54cfed5 link /test e2e-libvirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking wking deleted the delete-aws-instances-first branch September 5, 2019 02:31
wking added a commit to wking/openshift-installer that referenced this pull request Oct 3, 2019
Partitions are more complicated than I thought when I wrote the
hard-coded arn:aws:ec2:... prefix in 54cfed5 (pkg/destroy/aws/aws:
Return ARNs from terminateEC2InstancesByTags, 2019-08-13, openshift#2169).
While most of our regions use 'aws', China uses aws-cn [1], GovCloud
uses aws-us-gov [2], and there may be more.  Pull this ARN componet
from the AWS SDK instead to track that sort of thing.

The Gopkg.lock bump was generated with:

  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.1
   build date  : 2019-03-20
   git hash    : faa61893
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false

[1]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arns-syntax
[2]: https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/#pkg-overview
wking added a commit to wking/openshift-installer that referenced this pull request Oct 6, 2019
I'd added this to save on AWS API-call volume in fc70eb2
(pkg/destroy/aws: Skip direct tag deletion for network interfaces,
2019-05-02, openshift#1704).  But now that we support user-provided VPCs with
installer-provided network interfaces, it's no longer feasible.
Happily, the tagged network interfaces are just for installer-created
control-plane machines, see 4fcdc1c (data/aws: explicitly create
the network interface for masters to speed up DNS, 2019-03-04, openshift#1361).
And since cf69c1e (pkg/destroy/aws: Terminate instances before
other resources, 2019-08-06, openshift#2169) we are terminating instances first
and waiting for their termination to complete before proceeding to
other resources.  So we should no longer have the long runs of failed
network-interface deletion attempts while we wait for the instances to
terminate, and this commit should make our removal more robust without
increasing our API-call volume.
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
We might want this same sort of thing in deleteEC2NATGatewaysByVPC and
our other functions where the caller doesn't know what resources the
function might be deleting.  But at the moment, Abhinav is only asking
for a preemptive guard for doubled instance investigation [1].

The ARN syntax for instances is [2]:

  arn:aws:ec2:{region}:{account-id}:instance/{instance-id}

and while we have to stretch into the ec2Client and reservation
objects to find the region and account ID, constructing the ARN
ourselves isn't that bad.

Without this guard, any instances that terminateEC2InstancesByTags
terminates but which still show up via the generic tag search will
flow down through Run's deleteARN, through terminateEC2Instance, and
end up in terminateEC2InstanceByInstance where terminated instances
are silently skipped.  So it doesn't save us all that much to return
terminated instance ARNs, just the DescribeInstances call in
terminateEC2Instance.

[1]: openshift#2169 (comment)
[2]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-ec2
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
Partitions are more complicated than I thought when I wrote the
hard-coded arn:aws:ec2:... prefix in 54cfed5 (pkg/destroy/aws/aws:
Return ARNs from terminateEC2InstancesByTags, 2019-08-13, openshift#2169).
While most of our regions use 'aws', China uses aws-cn [1], GovCloud
uses aws-us-gov [2], and there may be more.  Pull this ARN componet
from the AWS SDK instead to track that sort of thing.

The Gopkg.lock bump was generated with:

  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.1
   build date  : 2019-03-20
   git hash    : faa61893
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false

[1]: https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arns-syntax
[2]: https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/#pkg-overview
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
I'd added this to save on AWS API-call volume in fc70eb2
(pkg/destroy/aws: Skip direct tag deletion for network interfaces,
2019-05-02, openshift#1704).  But now that we support user-provided VPCs with
installer-provided network interfaces, it's no longer feasible.
Happily, the tagged network interfaces are just for installer-created
control-plane machines, see 4fcdc1c (data/aws: explicitly create
the network interface for masters to speed up DNS, 2019-03-04, openshift#1361).
And since cf69c1e (pkg/destroy/aws: Terminate instances before
other resources, 2019-08-06, openshift#2169) we are terminating instances first
and waiting for their termination to complete before proceeding to
other resources.  So we should no longer have the long runs of failed
network-interface deletion attempts while we wait for the instances to
terminate, and this commit should make our removal more robust without
increasing our API-call volume.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants