Skip to content

fix: harden error handling and add SSH keepalive for v0.3.4#772

Merged
ArangoGutierrez merged 8 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/igw-detach-notfound
Apr 1, 2026
Merged

fix: harden error handling and add SSH keepalive for v0.3.4#772
ArangoGutierrez merged 8 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/igw-detach-notfound

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

Summary

Fixes the E2E failure reported in #771 by addressing the root causes: SSH session drops during long provisioning operations and NotFound errors during resource cleanup.

Changes

1. IGW Detach NotFound (pkg/provider/aws/delete.go)

When an Internet Gateway is already deleted, the detach step now recognizes InvalidInternetGatewayID.NotFound alongside Gateway.NotAttached and skips retries instead of looping maxRetries times.

2. NLB NotFound Handling (pkg/provider/aws/nlb.go)

All NLB cleanup paths (deleteNLB, deleteListener, deleteTargetGroup, deleteNLBForCluster) now check for LoadBalancerNotFound, ListenerNotFound, and TargetGroupNotFound before retrying, treating already-deleted resources as success.

3. SSH Keepalive & Handshake Timeout (pkg/provisioner/provisioner.go)

  • Keepalive: SSH connections send keepalive@holodeck probes every 30s to prevent session drops during long operations (e.g., kubeadm init).
  • Handshake timeout: A 15s deadline on the underlying TCP connection prevents ssh.NewClientConn from blocking indefinitely against hosts that accept TCP but never complete the SSH handshake (ssh.ClientConfig.Timeout only applies to ssh.Dial, not the transport path).

4. Cleanup NotFound Warnings (pkg/cleanup/cleanup.go)

The periodic cleanup job no longer logs misleading "Failed to detach/delete internet gateway" warnings when an IGW is already gone.

5. Version Bump to v0.3.4

Tests Added

  • pkg/provider/aws/delete_igw_test.go — 3 tests (NotFound, NotAttached, real error retries)
  • pkg/provider/aws/nlb_delete_test.go — 6 tests (NLB/listener/target-group NotFound + real errors)
  • pkg/provisioner/ssh_config_test.go — handshake timeout test using black-hole TCP server
  • pkg/cleanup/cleanup_ginkgo_test.go — IGW NotFound completion test

Closes #771

The detach retry block in deleteInternetGateway only checked for
Gateway.NotAttached. When the IGW was already deleted, DetachInternetGateway
returned InvalidInternetGatewayID.NotFound, which was retried up to
maxRetries times. Now both conditions are treated as success via the
new isAlreadyDetachedError helper.

Ref: NVIDIA#771
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
deleteNLB, deleteListener, and deleteTargetGroup had no NotFound
error handling. If any resource was already deleted, errors would
propagate instead of being treated as success. Added helper functions
isNLBNotFoundError, isTargetGroupNotFoundError, isListenerNotFoundError
and applied them to all delete and describe paths.

Ref: NVIDIA#771
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Long-running remote commands (kubeadm init, ~10-20 min) were failing
with ExitMissingError because network middleboxes drop idle TCP
connections. Added:
- 15s handshake timeout on ssh.ClientConfig.Timeout (for ssh.Dial)
- 15s deadline on net.Conn before ssh.NewClientConn (for transport path)
- 30s keepalive interval via SSH global requests (startKeepalive)

The keepalive goroutine self-terminates when the client is closed.

Ref: NVIDIA#771
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
When the periodic cleanup job encounters an IGW that is already gone
(InvalidInternetGatewayID.NotFound) or already detached
(Gateway.NotAttached), the warning is now silently skipped instead of
logging a misleading failure message.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 1, 2026

Pull Request Test Coverage Report for Build 23855746299

Details

  • 43 of 63 (68.25%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.3%) to 47.756%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/nlb.go 24 27 88.89%
pkg/cleanup/cleanup.go 4 8 50.0%
pkg/provisioner/provisioner.go 5 18 27.78%
Totals Coverage Status
Change from base Build 23843738221: 1.3%
Covered Lines: 5267
Relevant Lines: 11029

💛 - Coveralls

If the NLB is deleted between cache population and the describe call in
deleteNLBForCluster, the LoadBalancerNotFound error is now treated as
success instead of propagating as a hard error.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Extends the NotFound suppression from deleteInternetGateways to all
cleanup delete functions:
- deleteSecurityGroups: InvalidGroup.NotFound
- deleteSubnets: InvalidSubnetID.NotFound
- deleteRouteTables: InvalidRouteTableID.NotFound

When a resource is already gone, the warning is now silently skipped
instead of logging a misleading failure message.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez disabled auto-merge April 1, 2026 17:08
@ArangoGutierrez ArangoGutierrez merged commit d2c0a36 into NVIDIA:main Apr 1, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E failure on babab844

2 participants