Skip to content

Improve helm prereqs setup, preflight, and health checks#1389

Open
benhuntley wants to merge 1 commit intoNVIDIA:mainfrom
benhuntley:helm-prereqs-setup-and-healthcheck
Open

Improve helm prereqs setup, preflight, and health checks#1389
benhuntley wants to merge 1 commit intoNVIDIA:mainfrom
benhuntley:helm-prereqs-setup-and-healthcheck

Conversation

@benhuntley
Copy link
Copy Markdown

@benhuntley benhuntley commented May 5, 2026

Add a more flexible setup flow for helm-prereqs deployments:

  • Add setup.sh flags for skip-core, skip-rest, custom core values, custom MetalLB config, site overlay application, and debug tracing.
  • Make registry pull credentials optional so public, preloaded, or externally managed imagePullSecrets can be used.
  • Create imagepullsecret only when REGISTRY_PULL_SECRET is set, derive the registry server from NCX_IMAGE_REGISTRY, and support custom REGISTRY_PULL_USERNAME.
  • Skip REST-specific env and repo checks when --skip-rest is used, and skip Core image tag validation when --skip-core is used.
  • Expand preflight validation for custom MetalLB manifests or kustomize dirs, referenced imagePullSecrets, cluster reachability, schedulable nodes, per-node sysctls, and per-node DNS.
  • Sync Vault AppRole credentials through Vault KV and External Secrets instead of patching the Kubernetes Secret from a hook sidecar.
  • Wait for PostgreSQL credentials and Vault AppRole credentials before deploying NCX Core.
  • Harden Vault unseal handling and setup failure cleanup prompts.
  • Add health-check.sh for post-install validation of prereqs, Vault, PostgreSQL, Core pods, PKI, ExternalSecrets, LoadBalancer VIPs, DNS, NTP, and in-cluster service connectivity.
  • Update helm-prereqs documentation for the new setup modes, optional pull secret contract, custom values/config flags, and health check.

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@benhuntley benhuntley requested review from a team and shayan1995 as code owners May 5, 2026 03:43
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

  Add a more flexible setup flow for helm-prereqs deployments:

  - Add setup.sh flags for skip-core, skip-rest, custom core values,
    custom MetalLB config, site overlay application, and debug tracing.
  - Make registry pull credentials optional so public, preloaded, or
    externally managed imagePullSecrets can be used.
  - Create imagepullsecret only when REGISTRY_PULL_SECRET is set, derive
    the registry server from NCX_IMAGE_REGISTRY, and support custom
    REGISTRY_PULL_USERNAME.
  - Skip REST-specific env and repo checks when --skip-rest is used, and
    skip Core image tag validation when --skip-core is used.
  - Expand preflight validation for custom MetalLB manifests or kustomize
    dirs, referenced imagePullSecrets, cluster reachability, schedulable
    nodes, per-node sysctls, and per-node DNS.
  - Sync Vault AppRole credentials through Vault KV and External Secrets
    instead of patching the Kubernetes Secret from a hook sidecar.
  - Wait for PostgreSQL credentials and Vault AppRole credentials before
    deploying NCX Core.
  - Harden Vault unseal handling and setup failure cleanup prompts.
  - Add health-check.sh for post-install validation of prereqs, Vault,
    PostgreSQL, Core pods, PKI, ExternalSecrets, LoadBalancer VIPs, DNS,
    NTP, and in-cluster service connectivity.
  - Update helm-prereqs documentation for the new setup modes, optional
    pull secret contract, custom values/config flags, and health check.

Signed-off-by: Ben Huntley <bhuntley@nvidia.com>
@benhuntley benhuntley force-pushed the helm-prereqs-setup-and-healthcheck branch from a73296c to 7f32e1c Compare May 5, 2026 20:01
{{- if .Values.externalSecrets.enabled }}
## =============================================================================
## forge-roots — sync site-root CA cert to all carbide-managed namespaces.
## forge-roots — sync site-root CA cert to all NCX-managed namespaces.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCX Infra Controller became NVIDIA infra controller - so maybe just infra controller makes sense here? There's a few NCX references in this PR that should get updated where possible (some are breaking changes - those can be done more methodically).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants