Skip to content

fix: switch HA NLB to internal scheme to fix hairpin routing#760

Merged
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/ha-nlb-internal-scheme
Mar 31, 2026
Merged

fix: switch HA NLB to internal scheme to fix hairpin routing#760
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/ha-nlb-internal-scheme

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

Summary

  • Switch NLB from internet-facing to internal to fix hairpin routing that causes dial tcp ...:6443: i/o timeout in HA clusters
  • Remove NLB DNS propagation wait (internal NLBs resolve immediately via VPC DNS)

Root Cause

When a control-plane node connects to the NLB via its public DNS after kubeconfig switchover, the packet goes node → IGW → NLB → same node. AWS NLBs don't support this hairpin routing, so the connection times out.

Test plan

  • cluster && ha E2E test passes (post-merge)
  • Existing unit tests pass
  • go build ./... clean

Fixes #746

)

The internet-facing NLB resolves to a public IP. When control-plane
nodes connect to it after kubeconfig switchover, the hairpin routing
(node → IGW → NLB → same node) is not supported by AWS NLBs, causing
i/o timeouts on port 6443.

Switch to an internal NLB which gets a private VPC IP, routing traffic
directly within the VPC. Also remove the NLB DNS propagation wait since
internal NLBs resolve immediately via VPC DNS.

Fixes NVIDIA#746

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez merged commit cd30218 into NVIDIA:main Mar 31, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E failure on 1412392e

1 participant