fix: verify API server against local IP before switching to NLB#721
Merged
ArangoGutierrez merged 2 commits intoNVIDIA:mainfrom Mar 13, 2026
Merged
Conversation
HA cluster kubeadm init succeeded but kubectl version failed because admin.conf pointed at the NLB before its health checks passed. Verify the API server using the local private IP first, then switch to the NLB endpoint. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 23046734486Details
💛 - Coveralls |
Address 4 issues from Distinguished Engineer review: 1. BLOCKING: NODE_PRIVATE_IP and INIT_ENDPOINT were scoped inside the `if [[ ! -f admin.conf ]]` block but used outside it for API verification and NLB switch — would crash on re-run. Moved both assignments before the init guard. 2. Escape dots in INIT_ENDPOINT for sed regex safety — IP addresses contain literal dots that are regex metacharacters. 3. Move NLB switch block to after Calico installation per design: NLB health checks require a working CNI to pass. Previous position was before Calico, violating the design contract. 4. Add test assertion verifying Calico installation happens before NLB endpoint switch. Signed-off-by: Eduardo Arango <earango@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Improves HA kubeadm cluster bring-up reliability by ensuring the API server is reachable via the node’s private IP immediately after kubeadm init, and only switching kubeconfig/cluster endpoint to the NLB after Calico is installed (so NLB health checks have time to converge).
Changes:
- Detect node private IP early and use it for initial API server verification after
kubeadm init. - Defer updating kubeadm-config/admin.conf to the NLB endpoint until after Calico is installed and ready.
- Add a unit test asserting the ordering: local IP verification → Calico install → NLB switch.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/provisioner/templates/kubeadm_cluster.go | Reorders/adjusts kubeadm init flow to verify via local IP first and delay NLB switch until after Calico. |
| pkg/provisioner/templates/kubernetes_test.go | Adds a test to enforce the intended ordering in the generated kubeadm init script. |
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+200
to
+204
| # Verify API server against local private IP first. For HA clusters, admin.conf | ||
| # still points at the local IP at this stage. For non-HA clusters this is a no-op | ||
| # since KUBECONFIG already targets the right endpoint. | ||
| holodeck_retry 10 "$COMPONENT" kubectl --kubeconfig "$KUBECONFIG" \ | ||
| --server="https://${NODE_PRIVATE_IP}:6443" version |
Comment on lines
+1201
to
+1205
| func TestKubeadmInit_HA_VerifiesLocalIPBeforeNLBSwitch(t *testing.T) { | ||
| // HA clusters use an NLB endpoint. After kubeadm init, the API server must | ||
| // be verified against the local private IP BEFORE switching admin.conf to | ||
| // the NLB endpoint. Otherwise kubectl version hits the NLB which hasn't | ||
| // passed health checks yet, exhausting all retry attempts. |
|
|
||
| // Calico must be installed BEFORE the NLB switch — NLB health checks | ||
| // require a working CNI to pass. | ||
| assert.Greater(t, calicoInstallPos, 0, |
Comment on lines
+1233
to
+1242
| assert.Contains(t, out, localVerifyMarker, | ||
| "HA init template must verify API server against local private IP") | ||
| assert.Contains(t, out, nlbSwitchMarker, | ||
| "HA init template must update cluster config to use NLB endpoint") | ||
|
|
||
| // Critical ordering: local verification MUST happen BEFORE the NLB switch | ||
| localVerifyPos := strings.Index(out, localVerifyMarker) | ||
| nlbSwitchPos := strings.Index(out, nlbSwitchMarker) | ||
| calicoInstallMarker := "Installing Calico" | ||
| calicoInstallPos := strings.Index(out, calicoInstallMarker) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
kubeadm init, verify API server against local private IP firstArchitecture Decision
Follows approved design: docs/plans/2026-03-12-production-grade-clusters-design.md
Test plan