Skip to content

fix: HA NLB hairpin routing and cleanup (#746)#762

Merged
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/ha-nlb-internal-scheme
Mar 31, 2026
Merged

fix: HA NLB hairpin routing and cleanup (#746)#762
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/ha-nlb-internal-scheme

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

@ArangoGutierrez ArangoGutierrez commented Mar 31, 2026

Summary

Three fixes for HA cluster NLB issues:

  1. Switch NLB to internal scheme — eliminates public IP routing through IGW
  2. CP nodes use local API server — stop patching admin.conf to NLB endpoint on control-plane nodes, avoiding AWS NLB hairpin/loopback (where a registered target connects through the NLB back to itself)
  3. Add NLB cleanup to periodic cleaner — delete NLBs before VPC cleanup to prevent DependencyViolation errors that cause VpcLimitExceeded

Root Cause

AWS NLBs drop hairpin traffic — when a registered target connects through the NLB and gets routed back to itself, the connection times out. Since all CP nodes are NLB targets, using the NLB endpoint in admin.conf causes dial tcp ...:6443: i/o timeout.

The periodic cleanup (pkg/cleanup) also didn't handle NLB deletion, so stale NLBs blocked VPC cleanup, exhausting VPC quota and breaking CI.

Changes

  • pkg/provider/aws/nlb.go — NLB scheme changed to internal
  • pkg/provisioner/templates/kubeadm_cluster.go — removed admin.conf NLB patch on init node; added localhost:6443 patch on CP join nodes; removed DNS propagation wait
  • pkg/cleanup/cleanup.go — added ELBv2 client, NLB discovery and deletion before VPC cleanup

Test plan

  • go build ./... clean
  • go test ./pkg/provisioner/... pass
  • go test ./pkg/cleanup/... — 82/82 pass
  • cluster && ha E2E test passes (post-merge)
  • Periodic cleanup successfully deletes NLB-containing VPCs

Fixes #746

)

The internet-facing NLB resolves to a public IP. When control-plane
nodes connect to it after kubeconfig switchover, the hairpin routing
(node → IGW → NLB → same node) is not supported by AWS NLBs, causing
i/o timeouts on port 6443.

Switch to an internal NLB which gets a private VPC IP, routing traffic
directly within the VPC. Also remove the NLB DNS propagation wait since
internal NLBs resolve immediately via VPC DNS.

Fixes NVIDIA#746

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
AWS NLBs drop traffic when a registered target connects through the
NLB and gets routed back to itself (hairpin/loopback). This happens
on every control-plane node since they are all NLB targets.

Stop patching admin.conf to use the NLB endpoint on the init node.
On joining CP nodes, patch admin.conf to use localhost:6443 instead
of the NLB. The kubeadm-config ConfigMap still points to the NLB
so joining nodes and workers discover the correct endpoint.

Fixes NVIDIA#746

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 23799268336

Details

  • 0 of 4 (0.0%) changed or added relevant lines in 1 file are covered.
  • 42 unchanged lines in 2 files lost coverage.
  • Overall coverage remained the same at 45.798%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/nlb.go 0 4 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/provisioner/templates/kubeadm_cluster.go 10 78.72%
pkg/provider/aws/nlb.go 32 0.0%
Totals Coverage Status
Change from base Build 23791782575: 0.0%
Covered Lines: 4975
Relevant Lines: 10863

💛 - Coveralls

The periodic cleanup utility (pkg/cleanup) only handles EC2 resources
(instances, security groups, subnets, IGW, VPC). When HA clusters
create NLBs, the NLB ENIs in subnets block subnet and VPC deletion
with DependencyViolation errors.

Add ELBv2 client to the Cleaner and delete load balancers (with their
listeners and target groups) as the first step in VPC cleanup, before
instance termination. Include a 30s wait after NLB deletion for ENI
detachment.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez changed the title fix: use local API server on CP nodes to avoid NLB hairpin (#746) fix: HA NLB hairpin routing and cleanup (#746) Mar 31, 2026
@ArangoGutierrez ArangoGutierrez merged commit 41454b9 into NVIDIA:main Mar 31, 2026
17 checks passed
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Mar 31, 2026
Patch release with fixes for HA NLB hairpin routing (NVIDIA#746) and
VPC cleanup improvements (NVIDIA#762).

Changes since v0.3.0:
- fix: CP nodes use localhost:6443 to avoid NLB hairpin timeouts
- fix: switch HA NLB to internal scheme
- fix: add NLB cleanup to periodic VPC cleaner
- ci: update periodic cleanup to v0.3.0 with manual trigger

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit that referenced this pull request Mar 31, 2026
Patch release with fixes for HA NLB hairpin routing (#746) and
VPC cleanup improvements (#762).

Changes since v0.3.0:
- fix: CP nodes use localhost:6443 to avoid NLB hairpin timeouts
- fix: switch HA NLB to internal scheme
- fix: add NLB cleanup to periodic VPC cleaner
- ci: update periodic cleanup to v0.3.0 with manual trigger

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E failure on 1412392e

2 participants