fix: harden error handling and add SSH keepalive for v0.3.4#772
Merged
ArangoGutierrez merged 8 commits intoNVIDIA:mainfrom Apr 1, 2026
Merged
fix: harden error handling and add SSH keepalive for v0.3.4#772ArangoGutierrez merged 8 commits intoNVIDIA:mainfrom
ArangoGutierrez merged 8 commits intoNVIDIA:mainfrom
Conversation
The detach retry block in deleteInternetGateway only checked for Gateway.NotAttached. When the IGW was already deleted, DetachInternetGateway returned InvalidInternetGatewayID.NotFound, which was retried up to maxRetries times. Now both conditions are treated as success via the new isAlreadyDetachedError helper. Ref: NVIDIA#771 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
deleteNLB, deleteListener, and deleteTargetGroup had no NotFound error handling. If any resource was already deleted, errors would propagate instead of being treated as success. Added helper functions isNLBNotFoundError, isTargetGroupNotFoundError, isListenerNotFoundError and applied them to all delete and describe paths. Ref: NVIDIA#771 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Long-running remote commands (kubeadm init, ~10-20 min) were failing with ExitMissingError because network middleboxes drop idle TCP connections. Added: - 15s handshake timeout on ssh.ClientConfig.Timeout (for ssh.Dial) - 15s deadline on net.Conn before ssh.NewClientConn (for transport path) - 30s keepalive interval via SSH global requests (startKeepalive) The keepalive goroutine self-terminates when the client is closed. Ref: NVIDIA#771 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
When the periodic cleanup job encounters an IGW that is already gone (InvalidInternetGatewayID.NotFound) or already detached (Gateway.NotAttached), the warning is now silently skipped instead of logging a misleading failure message. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 23855746299Details
💛 - Coveralls |
If the NLB is deleted between cache population and the describe call in deleteNLBForCluster, the LoadBalancerNotFound error is now treated as success instead of propagating as a hard error. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Extends the NotFound suppression from deleteInternetGateways to all cleanup delete functions: - deleteSecurityGroups: InvalidGroup.NotFound - deleteSubnets: InvalidSubnetID.NotFound - deleteRouteTables: InvalidRouteTableID.NotFound When a resource is already gone, the warning is now silently skipped instead of logging a misleading failure message. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the E2E failure reported in #771 by addressing the root causes: SSH session drops during long provisioning operations and NotFound errors during resource cleanup.
Changes
1. IGW Detach NotFound (
pkg/provider/aws/delete.go)When an Internet Gateway is already deleted, the detach step now recognizes
InvalidInternetGatewayID.NotFoundalongsideGateway.NotAttachedand skips retries instead of loopingmaxRetriestimes.2. NLB NotFound Handling (
pkg/provider/aws/nlb.go)All NLB cleanup paths (
deleteNLB,deleteListener,deleteTargetGroup,deleteNLBForCluster) now check forLoadBalancerNotFound,ListenerNotFound, andTargetGroupNotFoundbefore retrying, treating already-deleted resources as success.3. SSH Keepalive & Handshake Timeout (
pkg/provisioner/provisioner.go)keepalive@holodeckprobes every 30s to prevent session drops during long operations (e.g.,kubeadm init).ssh.NewClientConnfrom blocking indefinitely against hosts that accept TCP but never complete the SSH handshake (ssh.ClientConfig.Timeoutonly applies tossh.Dial, not the transport path).4. Cleanup NotFound Warnings (
pkg/cleanup/cleanup.go)The periodic cleanup job no longer logs misleading "Failed to detach/delete internet gateway" warnings when an IGW is already gone.
5. Version Bump to v0.3.4
Tests Added
pkg/provider/aws/delete_igw_test.go— 3 tests (NotFound, NotAttached, real error retries)pkg/provider/aws/nlb_delete_test.go— 6 tests (NLB/listener/target-group NotFound + real errors)pkg/provisioner/ssh_config_test.go— handshake timeout test using black-hole TCP serverpkg/cleanup/cleanup_ginkgo_test.go— IGW NotFound completion testCloses #771