Skip to content

Document AgentCore ENI cleanup workflow for stack destroy/redeploy #111

@scoropeza

Description

@scoropeza

Follow-up from PR #88 — repeatedly hit during deploy/destroy cycles on the dev stack; cost a session of on-call time once.

Functional description

When you destroy an ABCA stack (cdk destroy backgroundagent-dev), CloudFormation removes the BedrockAgentCore::Runtime resource and reports success. AgentCore continues to hold ENIs (Elastic Network Interfaces) in the stack's VPC for up to 8 hours afterward, even though the runtime resource itself is gone. Attempting to destroy the dependent VPC, subnets, or security groups during that window fails with DependencyViolation.

Today this is undocumented anywhere in the public ABCA docs. The first time a team hits it, they're staring at a CloudFormation stack stuck in DELETE_FAILED with no obvious recovery, no AWS-side warning at create time, and no way to expedite the ENI cleanup. The only successful workaround we've used is to retain the dependent VPC/SG/subnets via delete-stack --retain-resources and clean them up via tagged sweep when the ENIs eventually free.

User-visible impact:

  • cdk destroy reports success for BedrockAgentCore::Runtime, then fails on the next dependent resource with DependencyViolation.
  • Stack ends up in DELETE_FAILED; operator must intervene manually with no documented procedure.
  • Re-running cdk deploy on a fresh stack name works (account is not blocked) but the orphaned VPC/SG/subnets continue to accrue NAT Gateway / VPC Endpoint costs while waiting for the ENI lease to expire.
  • For a team destroying + redeploying multiple times during dev (e.g. testing IAM changes), this compounds — one trip to AWS Support (because the operator didn't know to wait 8h) is enough to lose a half-day of work.

Technical root cause

Why AgentCore behaves this way: BedrockAgentCore::Runtime provisions ENIs in the operator's VPC for outbound traffic during runtime invocations. When the runtime is deleted, the AgentCore service holds the ENI lease for ~8h "for warm restart purposes" — there is no way to force-release them via the public API. AWS Support can release them on request but turnaround is 1-2 business days.

Where this surfaces:

  • cdk/src/constructs/agent-vpc.ts — defines the VPC; doesn't mark the SGs as RemovalPolicy.RETAIN.
  • cdk/src/stacks/agent.ts — the BedrockAgentCore::Runtime is the resource that creates the ENIs.
  • feedback_agentcore_eni_cleanup_workflow.md (internal session memory) — the workaround pattern documented here.

The workaround pattern (proven on backgroundagent-dev, account 169728770098):

  1. Before destroy, tag every retainable VPC/SG/subnet with:
    • cleanup:reason = "agentcore-eni-pending-release"
    • cleanup:safe-to-delete-after = ISO timestamp 8h from now
    • cleanup:related-enis = comma-separated ENI IDs
    • cleanup:original-stack = stack name
    • cleanup:owner = email or team
  2. aws cloudformation delete-stack --stack-name <name> --retain-resources <vpc-id>,<sg-id>,<subnet-1>,<subnet-2>.
  3. After the safe-to-delete-after timestamp passes, run a sweep script that lists tagged resources and deletes the ones whose ENIs are gone.

This works but every team has to derive it from scratch.

Proposed options

Recommend option A (documentation + tag schema) as the lightweight ship. Option B (CDK construct) is a follow-on if the pattern proves common.

Option A — documentation + sweep script:

  • Add docs/operations/agentcore-eni-cleanup.md documenting the pattern.
  • Add scripts/sweep-agentcore-orphans.sh that finds tagged orphans, checks ENI status, and reports/deletes.
  • Cross-link from the developer guide and the destroy section of the README.

Option B — CDK construct:

  • New AgentCoreEniAwareDestroy construct that auto-tags the VPC/SG/subnets at deploy time and registers a Lambda-based custom resource that runs the cleanup sweep on stack destroy.
  • More invasive; opt-in per stack.

Option C — feature request to AWS:

  • File an AWS support case asking for either (a) a ForceReleaseENIs API on AgentCore or (b) a deploy-time option that marks the runtime's ENIs as ephemeral. This is a long lead time but the right structural fix.

Acceptance criteria

  • docs/operations/agentcore-eni-cleanup.md documents detection, the tag schema, the destroy-with-retain procedure, and the sweep flow
  • Either scripts/sweep-agentcore-orphans.sh exists OR the doc references the manual aws ec2 describe-network-interfaces + aws ec2 delete-vpc chain
  • Cross-linked from docs/guides/DEVELOPER_GUIDE.md (search for "destroy")
  • README.md gets a "Known issues" section pointing here

Out of scope

  • Building the option B CDK construct (that's a separate enhancement issue).
  • Filing the AWS feature request (separate; track in the issue body but don't block on it).
  • Sweep automation that runs on a schedule — operator-driven is fine for now.

References

  • feedback_agentcore_eni_cleanup_workflow.md (internal session memory; tag schema came from here)
  • cdk/src/constructs/agent-vpc.ts (current VPC construct, no retention marking)
  • AWS Bedrock AgentCore documentation: https://docs.aws.amazon.com/bedrock-agentcore/ (no public mention of ENI lease semantics)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions