fix: wait for NAT Gateway available state before creating routes by ArangoGutierrez · Pull Request #735 · NVIDIA/holodeck

ArangoGutierrez · 2026-03-18T10:21:22Z

Summary

Bug 1 (NAT GW race): CreateNatGateway returns immediately with a pending NAT Gateway ID. The subsequent createPrivateRouteTable call uses this ID in CreateRoute, which fails with InvalidNatGatewayID.NotFound because the NAT GW hasn't propagated yet.
Fix 1: Poll DescribeNatGateways after creation until state transitions to available (or failed), matching the existing pattern in deleteNATGateway.
Bug 2 (SG cleanup race): After instance termination, ENIs linger for 2-5 minutes in in-use state. DeleteSecurityGroup fails with DependencyViolation because ENIs still reference the SG. Additionally, cross-SG references (worker SG → CP SG) prevent deletion.
Fix 2: (a) Add ENI drain phase — poll DescribeNetworkInterfaces until all ENIs are detached (5 min timeout). (b) Revoke all SG ingress/egress rules before deletion to break cross-references. (c) Increase retry budget to 10 retries with 10s initial backoff.
Architecture change: Place cluster instances in public subnet with public IPs (avoids EIP quota exhaustion from NAT GW).

Changes

pkg/provider/aws/create.go: Add wait loop in createNATGateway polling for available state (60 × 5s = 5 min max)
pkg/provider/aws/cluster.go: Place instances in public subnet, skip NAT GW creation in cluster path
pkg/provider/aws/delete.go: Add waitForENIsDrained(), revokeSecurityGroupRules(), increase retry budget
pkg/provider/aws/aws.go: Add RevokeSecurityGroupIngress/Egress to EC2Client interface
pkg/provider/aws/create_test.go: Fix mock to return NAT GW in available state
pkg/provider/aws/cluster_test.go: Update tests for public subnet placement, add NAT GW wait + failed state tests

Test plan

Unit tests pass: go test ./pkg/provider/aws/ (84 Ginkgo + Go tests)
E2E cluster tests pass (all flavors: default, legacy, dra, kernel, rpm-rocky, rpm-al2023, rpm-fedora)

coveralls · 2026-03-18T11:39:51Z

Pull Request Test Coverage Report for Build 23486138632

Details

93 of 126 (73.81%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.5%) to 56.313%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/provider/aws/create.go	22	26	84.62%
pkg/provider/aws/cluster.go	2	8	25.0%
pkg/testutil/mocks/aws.go	0	6	0.0%
pkg/provider/aws/delete.go	69	86	80.23%

Totals
Change from base Build 23291519604:	0.5%
Covered Lines:	3724
Relevant Lines:	6613

💛 - Coveralls

…utes CreateNatGateway returns immediately with a pending NAT Gateway ID. Using this ID in CreateRoute before the NAT GW is available causes InvalidNatGatewayID.NotFound errors, breaking all cluster deployments. Poll DescribeNatGateways until state transitions to available (or failed) before returning from createNATGateway. This matches the existing pattern in deleteNATGateway which polls for deleted state. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

In the production cluster topology all nodes are in the private subnet and do not have public IPs. The E2E test was asserting PublicIP is non-empty for all nodes, which fails by design. Only PrivateIP is guaranteed. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Cluster nodes in private subnets have no public IP, so SSH connections must go through SSM port-forwarding. The E2E test and GetClusterHealth were building NodeInfo without setting the Transport field, causing SSH to attempt direct connection to an empty host. Wire SSMTransport for nodes with no PublicIP, matching the production code path in buildClusterNodeInfoList. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The private subnet topology (PR NVIDIA#728) moved cluster instances behind a NAT GW, but SSM infrastructure (IAM instance profiles) was never set up, making SSH connections impossible. Revert instances to the public subnet with AssociatePublicIpAddress=true so they are directly reachable via SSH. The NAT GW + NLB infrastructure remains in place for future SSM migration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Cluster instances are placed in the public subnet with AssociatePublicIpAddress=true, so they reach the internet directly through the IGW. The NAT Gateway (and its EIP allocation + private route table) are unnecessary for this topology. Removing them from the cluster code path frees up EIP quota (AWS limit: 5 per region in us-west-1), which was being exhausted when multiple CI jobs ran concurrently, causing AddressLimitExceeded errors on the rpm-rocky and rpm-al2023 E2E tests. The private subnet is retained for future SSM endpoint use. Single-node mode (create.go) is unchanged and continues to use NAT Gateway for its private-subnet instances. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

…face Extend the EC2Client interface with RevokeSecurityGroupIngress and RevokeSecurityGroupEgress methods needed for proper SG cleanup. Update both mock implementations to satisfy the interface. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

…tion Before calling DeleteSecurityGroup, enumerate and revoke all ingress and egress rules. This breaks cross-SG references (e.g., worker SG referencing control-plane SG) that cause DependencyViolation errors. Gracefully handles already-deleted groups (InvalidGroup.NotFound). Revoke errors are logged as warnings and do not block the delete attempt. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

After instance termination, AWS ENIs can linger for 2-5 minutes in 'in-use' state. Attempting to delete security groups while ENIs reference them causes DependencyViolation errors. Add waitForENIsDrained() that polls DescribeNetworkInterfaces every 10s (up to 5min) until all non-available ENIs in the VPC are gone. Insert this as Phase 1.5 between instance termination and SG deletion. ENI drain timeout is a warning, not a fatal error — the extended retry budget on deleteSecurityGroup provides a second line of defense. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

…lays Increase maxRetries from 5 to 10 and retryDelay from 5s to 10s. This extends the total retry budget from ~65s to ~5min, covering the full AWS ENI detach window (2-5 min) as a final safety net beyond the explicit ENI drain wait and SG rule revocation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The EC2Client interface gained RevokeSecurityGroupIngress and RevokeSecurityGroupEgress in commit fa13fef, but the shared mock in pkg/testutil/mocks was not updated. This broke pkg/cleanup tests which use the shared mock. Also fix pre-existing gofmt issue in mock_ec2_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

…ore deletion Security group deletion failed with DependencyViolation because worker SG ingress rules referenced CP SG and vice versa. Fix: revoke both SGs' ingress rules upfront before any deletion. Egress revocation is skipped because CI IAM user lacks ec2:RevokeSecurityGroupEgress permission (and the default egress rule doesn't create cross-SG dependencies). Route table deletion also failed with DependencyViolation because subnet associations were not removed first. Fix: describe route table associations and disassociate non-main entries before calling DeleteRouteTable. Both fixes validated with local E2E: create + delete of a minimal cluster completed cleanly with no leaked resources. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

TestRevokeSecurityGroupRules_RevokesIngressAndEgress asserted that egress revocation was called, but the implementation now intentionally skips it (CI IAM lacks ec2:RevokeSecurityGroupEgress). Updated test to assert 0 egress calls and renamed to TestRevokeSecurityGroupRules_RevokesIngressOnly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

…tion CI IAM user (cnt-ci) lacks ec2:DisassociateRouteTable, so the explicit disassociation approach fails with 403. Instead, swap the deletion order: delete subnets first (which implicitly removes route table associations), then delete route tables. This avoids needing DisassociateRouteTable entirely. Remove the disassociateRouteTableSubnets helper and DisassociateRouteTable from the EC2Client interface since they are no longer needed. Validated with local E2E: create + delete completes cleanly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez added 5 commits March 19, 2026 13:39

ArangoGutierrez force-pushed the fix/nat-gateway-wait-available branch from 0ee88ca to 7b7b7ed Compare March 19, 2026 12:40

ArangoGutierrez added 8 commits March 19, 2026 18:28

ArangoGutierrez merged commit 1963bf2 into NVIDIA:main Mar 24, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wait for NAT Gateway available state before creating routes#735

fix: wait for NAT Gateway available state before creating routes#735
ArangoGutierrez merged 13 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/nat-gateway-wait-available

ArangoGutierrez commented Mar 18, 2026 •

edited

Loading

Uh oh!

coveralls commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArangoGutierrez commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

coveralls commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 23486138632

Details

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArangoGutierrez commented Mar 18, 2026 •

edited

Loading

coveralls commented Mar 18, 2026 •

edited

Loading