Skip to content

fix: verify API server against local IP before switching to NLB#721

Merged
ArangoGutierrez merged 2 commits intoNVIDIA:mainfrom
ArangoGutierrez:feat/prod-clusters-ha-fix
Mar 13, 2026
Merged

fix: verify API server against local IP before switching to NLB#721
ArangoGutierrez merged 2 commits intoNVIDIA:mainfrom
ArangoGutierrez:feat/prod-clusters-ha-fix

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

Summary

  • Fix HA kubeadm init reliability (release blocker)
  • After kubeadm init, verify API server against local private IP first
  • Then switch kubeconfig to NLB endpoint
  • Ensures Calico installation proceeds immediately while NLB health checks converge

Architecture Decision

Follows approved design: docs/plans/2026-03-12-production-grade-clusters-design.md

Test plan

  • New unit test: verifies local IP check happens before NLB switch
  • All existing template tests pass

HA cluster kubeadm init succeeded but kubectl version failed because
admin.conf pointed at the NLB before its health checks passed. Verify
the API server using the local private IP first, then switch to the
NLB endpoint.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 23046734486

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 49.567%

Totals Coverage Status
Change from base Build 23044432389: 0.0%
Covered Lines: 2916
Relevant Lines: 5883

💛 - Coveralls

Address 4 issues from Distinguished Engineer review:

1. BLOCKING: NODE_PRIVATE_IP and INIT_ENDPOINT were scoped inside
   the `if [[ ! -f admin.conf ]]` block but used outside it for
   API verification and NLB switch — would crash on re-run.
   Moved both assignments before the init guard.

2. Escape dots in INIT_ENDPOINT for sed regex safety — IP addresses
   contain literal dots that are regex metacharacters.

3. Move NLB switch block to after Calico installation per design:
   NLB health checks require a working CNI to pass. Previous
   position was before Calico, violating the design contract.

4. Add test assertion verifying Calico installation happens before
   NLB endpoint switch.

Signed-off-by: Eduardo Arango <earango@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves HA kubeadm cluster bring-up reliability by ensuring the API server is reachable via the node’s private IP immediately after kubeadm init, and only switching kubeconfig/cluster endpoint to the NLB after Calico is installed (so NLB health checks have time to converge).

Changes:

  • Detect node private IP early and use it for initial API server verification after kubeadm init.
  • Defer updating kubeadm-config/admin.conf to the NLB endpoint until after Calico is installed and ready.
  • Add a unit test asserting the ordering: local IP verification → Calico install → NLB switch.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/provisioner/templates/kubeadm_cluster.go Reorders/adjusts kubeadm init flow to verify via local IP first and delay NLB switch until after Calico.
pkg/provisioner/templates/kubernetes_test.go Adds a test to enforce the intended ordering in the generated kubeadm init script.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +200 to +204
# Verify API server against local private IP first. For HA clusters, admin.conf
# still points at the local IP at this stage. For non-HA clusters this is a no-op
# since KUBECONFIG already targets the right endpoint.
holodeck_retry 10 "$COMPONENT" kubectl --kubeconfig "$KUBECONFIG" \
--server="https://${NODE_PRIVATE_IP}:6443" version
Comment on lines +1201 to +1205
func TestKubeadmInit_HA_VerifiesLocalIPBeforeNLBSwitch(t *testing.T) {
// HA clusters use an NLB endpoint. After kubeadm init, the API server must
// be verified against the local private IP BEFORE switching admin.conf to
// the NLB endpoint. Otherwise kubectl version hits the NLB which hasn't
// passed health checks yet, exhausting all retry attempts.

// Calico must be installed BEFORE the NLB switch — NLB health checks
// require a working CNI to pass.
assert.Greater(t, calicoInstallPos, 0,
Comment on lines +1233 to +1242
assert.Contains(t, out, localVerifyMarker,
"HA init template must verify API server against local private IP")
assert.Contains(t, out, nlbSwitchMarker,
"HA init template must update cluster config to use NLB endpoint")

// Critical ordering: local verification MUST happen BEFORE the NLB switch
localVerifyPos := strings.Index(out, localVerifyMarker)
nlbSwitchPos := strings.Index(out, nlbSwitchMarker)
calicoInstallMarker := "Installing Calico"
calicoInstallPos := strings.Index(out, calicoInstallMarker)
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review March 13, 2026 12:52
@ArangoGutierrez ArangoGutierrez merged commit 6651aff into NVIDIA:main Mar 13, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants