fix: HA NLB hairpin routing and cleanup (#746)#762
Merged
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom Mar 31, 2026
Merged
fix: HA NLB hairpin routing and cleanup (#746)#762ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
Conversation
) The internet-facing NLB resolves to a public IP. When control-plane nodes connect to it after kubeconfig switchover, the hairpin routing (node → IGW → NLB → same node) is not supported by AWS NLBs, causing i/o timeouts on port 6443. Switch to an internal NLB which gets a private VPC IP, routing traffic directly within the VPC. Also remove the NLB DNS propagation wait since internal NLBs resolve immediately via VPC DNS. Fixes NVIDIA#746 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
AWS NLBs drop traffic when a registered target connects through the NLB and gets routed back to itself (hairpin/loopback). This happens on every control-plane node since they are all NLB targets. Stop patching admin.conf to use the NLB endpoint on the init node. On joining CP nodes, patch admin.conf to use localhost:6443 instead of the NLB. The kubeadm-config ConfigMap still points to the NLB so joining nodes and workers discover the correct endpoint. Fixes NVIDIA#746 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 23799268336Details
💛 - Coveralls |
The periodic cleanup utility (pkg/cleanup) only handles EC2 resources (instances, security groups, subnets, IGW, VPC). When HA clusters create NLBs, the NLB ENIs in subnets block subnet and VPC deletion with DependencyViolation errors. Add ELBv2 client to the Cleaner and delete load balancers (with their listeners and target groups) as the first step in VPC cleanup, before instance termination. Include a 30s wait after NLB deletion for ENI detachment. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Closed
ArangoGutierrez
added a commit
to ArangoGutierrez/holodeck
that referenced
this pull request
Mar 31, 2026
Patch release with fixes for HA NLB hairpin routing (NVIDIA#746) and VPC cleanup improvements (NVIDIA#762). Changes since v0.3.0: - fix: CP nodes use localhost:6443 to avoid NLB hairpin timeouts - fix: switch HA NLB to internal scheme - fix: add NLB cleanup to periodic VPC cleaner - ci: update periodic cleanup to v0.3.0 with manual trigger Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez
added a commit
that referenced
this pull request
Mar 31, 2026
Patch release with fixes for HA NLB hairpin routing (#746) and VPC cleanup improvements (#762). Changes since v0.3.0: - fix: CP nodes use localhost:6443 to avoid NLB hairpin timeouts - fix: switch HA NLB to internal scheme - fix: add NLB cleanup to periodic VPC cleaner - ci: update periodic cleanup to v0.3.0 with manual trigger Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes for HA cluster NLB issues:
admin.confto NLB endpoint on control-plane nodes, avoiding AWS NLB hairpin/loopback (where a registered target connects through the NLB back to itself)DependencyViolationerrors that causeVpcLimitExceededRoot Cause
AWS NLBs drop hairpin traffic — when a registered target connects through the NLB and gets routed back to itself, the connection times out. Since all CP nodes are NLB targets, using the NLB endpoint in
admin.confcausesdial tcp ...:6443: i/o timeout.The periodic cleanup (
pkg/cleanup) also didn't handle NLB deletion, so stale NLBs blocked VPC cleanup, exhausting VPC quota and breaking CI.Changes
pkg/provider/aws/nlb.go— NLB scheme changed to internalpkg/provisioner/templates/kubeadm_cluster.go— removedadmin.confNLB patch on init node; addedlocalhost:6443patch on CP join nodes; removed DNS propagation waitpkg/cleanup/cleanup.go— added ELBv2 client, NLB discovery and deletion before VPC cleanupTest plan
go build ./...cleango test ./pkg/provisioner/...passgo test ./pkg/cleanup/...— 82/82 passcluster && haE2E test passes (post-merge)Fixes #746