fix: wait for NAT Gateway available state before creating routes#735
Merged
ArangoGutierrez merged 13 commits intoNVIDIA:mainfrom Mar 24, 2026
Merged
Conversation
Pull Request Test Coverage Report for Build 23486138632Details
💛 - Coveralls |
…utes CreateNatGateway returns immediately with a pending NAT Gateway ID. Using this ID in CreateRoute before the NAT GW is available causes InvalidNatGatewayID.NotFound errors, breaking all cluster deployments. Poll DescribeNatGateways until state transitions to available (or failed) before returning from createNATGateway. This matches the existing pattern in deleteNATGateway which polls for deleted state. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
In the production cluster topology all nodes are in the private subnet and do not have public IPs. The E2E test was asserting PublicIP is non-empty for all nodes, which fails by design. Only PrivateIP is guaranteed. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Cluster nodes in private subnets have no public IP, so SSH connections must go through SSM port-forwarding. The E2E test and GetClusterHealth were building NodeInfo without setting the Transport field, causing SSH to attempt direct connection to an empty host. Wire SSMTransport for nodes with no PublicIP, matching the production code path in buildClusterNodeInfoList. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The private subnet topology (PR NVIDIA#728) moved cluster instances behind a NAT GW, but SSM infrastructure (IAM instance profiles) was never set up, making SSH connections impossible. Revert instances to the public subnet with AssociatePublicIpAddress=true so they are directly reachable via SSH. The NAT GW + NLB infrastructure remains in place for future SSM migration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Cluster instances are placed in the public subnet with AssociatePublicIpAddress=true, so they reach the internet directly through the IGW. The NAT Gateway (and its EIP allocation + private route table) are unnecessary for this topology. Removing them from the cluster code path frees up EIP quota (AWS limit: 5 per region in us-west-1), which was being exhausted when multiple CI jobs ran concurrently, causing AddressLimitExceeded errors on the rpm-rocky and rpm-al2023 E2E tests. The private subnet is retained for future SSM endpoint use. Single-node mode (create.go) is unchanged and continues to use NAT Gateway for its private-subnet instances. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
0ee88ca to
7b7b7ed
Compare
…face Extend the EC2Client interface with RevokeSecurityGroupIngress and RevokeSecurityGroupEgress methods needed for proper SG cleanup. Update both mock implementations to satisfy the interface. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…tion Before calling DeleteSecurityGroup, enumerate and revoke all ingress and egress rules. This breaks cross-SG references (e.g., worker SG referencing control-plane SG) that cause DependencyViolation errors. Gracefully handles already-deleted groups (InvalidGroup.NotFound). Revoke errors are logged as warnings and do not block the delete attempt. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
After instance termination, AWS ENIs can linger for 2-5 minutes in 'in-use' state. Attempting to delete security groups while ENIs reference them causes DependencyViolation errors. Add waitForENIsDrained() that polls DescribeNetworkInterfaces every 10s (up to 5min) until all non-available ENIs in the VPC are gone. Insert this as Phase 1.5 between instance termination and SG deletion. ENI drain timeout is a warning, not a fatal error — the extended retry budget on deleteSecurityGroup provides a second line of defense. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…lays Increase maxRetries from 5 to 10 and retryDelay from 5s to 10s. This extends the total retry budget from ~65s to ~5min, covering the full AWS ENI detach window (2-5 min) as a final safety net beyond the explicit ENI drain wait and SG rule revocation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The EC2Client interface gained RevokeSecurityGroupIngress and RevokeSecurityGroupEgress in commit fa13fef, but the shared mock in pkg/testutil/mocks was not updated. This broke pkg/cleanup tests which use the shared mock. Also fix pre-existing gofmt issue in mock_ec2_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…ore deletion Security group deletion failed with DependencyViolation because worker SG ingress rules referenced CP SG and vice versa. Fix: revoke both SGs' ingress rules upfront before any deletion. Egress revocation is skipped because CI IAM user lacks ec2:RevokeSecurityGroupEgress permission (and the default egress rule doesn't create cross-SG dependencies). Route table deletion also failed with DependencyViolation because subnet associations were not removed first. Fix: describe route table associations and disassociate non-main entries before calling DeleteRouteTable. Both fixes validated with local E2E: create + delete of a minimal cluster completed cleanly with no leaked resources. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
TestRevokeSecurityGroupRules_RevokesIngressAndEgress asserted that egress revocation was called, but the implementation now intentionally skips it (CI IAM lacks ec2:RevokeSecurityGroupEgress). Updated test to assert 0 egress calls and renamed to TestRevokeSecurityGroupRules_RevokesIngressOnly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
…tion CI IAM user (cnt-ci) lacks ec2:DisassociateRouteTable, so the explicit disassociation approach fails with 403. Instead, swap the deletion order: delete subnets first (which implicitly removes route table associations), then delete route tables. This avoids needing DisassociateRouteTable entirely. Remove the disassociateRouteTableSubnets helper and DisassociateRouteTable from the EC2Client interface since they are no longer needed. Validated with local E2E: create + delete completes cleanly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CreateNatGatewayreturns immediately with apendingNAT Gateway ID. The subsequentcreatePrivateRouteTablecall uses this ID inCreateRoute, which fails withInvalidNatGatewayID.NotFoundbecause the NAT GW hasn't propagated yet.DescribeNatGatewaysafter creation until state transitions toavailable(orfailed), matching the existing pattern indeleteNATGateway.in-usestate.DeleteSecurityGroupfails withDependencyViolationbecause ENIs still reference the SG. Additionally, cross-SG references (worker SG → CP SG) prevent deletion.DescribeNetworkInterfacesuntil all ENIs are detached (5 min timeout). (b) Revoke all SG ingress/egress rules before deletion to break cross-references. (c) Increase retry budget to 10 retries with 10s initial backoff.Changes
pkg/provider/aws/create.go: Add wait loop increateNATGatewaypolling foravailablestate (60 × 5s = 5 min max)pkg/provider/aws/cluster.go: Place instances in public subnet, skip NAT GW creation in cluster pathpkg/provider/aws/delete.go: AddwaitForENIsDrained(),revokeSecurityGroupRules(), increase retry budgetpkg/provider/aws/aws.go: AddRevokeSecurityGroupIngress/Egressto EC2Client interfacepkg/provider/aws/create_test.go: Fix mock to return NAT GW inavailablestatepkg/provider/aws/cluster_test.go: Update tests for public subnet placement, add NAT GW wait + failed state testsTest plan
go test ./pkg/provider/aws/(84 Ginkgo + Go tests)