Skip to content

fix(ci): improve GPU test reliability and deploy timeout handling#539

Merged
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/ci-helm-hook-timeout
Apr 13, 2026
Merged

fix(ci): improve GPU test reliability and deploy timeout handling#539
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/ci-helm-hook-timeout

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 10, 2026

Summary

Fix deploy.sh hook timeout handling, hook Job cleanup on retry, and multiple GPU CI reliability issues (kind load timeouts, evidence gating, workflow timeouts, selective validator builds, per-component Helm timeouts).

Motivation / Context

Eight related GPU CI reliability issues:

  1. Helm hook timeout: The dynamo-platform chart has a pre-install hook Job (ssh-keygen) that must complete before install. When --no-wait cleared both --wait and --timeout, Helm fell back to its 5-minute default — insufficient on cold runners where the hook image must be pulled.

  2. Failed/stuck hook Jobs block retries: When a Helm hook Job (e.g., kai-scheduler crd-upgrader) times out, it stays in the namespace and blocks subsequent helm upgrade --install retries with "Job not ready" errors. This includes both failed Jobs and Jobs stuck in Pending (where Helm's --timeout expired before the pod started). Observed on PR fix(ci): auto-label new issues by area and assign owners #535 conformance test.

  3. Async components lost hook timeout: ASYNC_COMPONENTS (kai-scheduler) cleared COMPONENT_WAIT_ARGS entirely, removing --timeout along with --wait. Hook Jobs on async components fell back to Helm's 5-minute default.

  4. Kind image load timeout: kind load docker-image had a 300s timeout per attempt. On slow runners, loading the ~250MB CUDA base image to a 2-node cluster exceeded this — both attempts timed out (exit code 124).

  5. Evidence collection after failure: Collect AI conformance evidence used if: always(), causing it to run (and fail) even when earlier steps like Build aicr failed.

  6. DRA rollout skipped under --no-wait: The DRA kubelet plugin restart+rollout-status is a correctness gate (kubelet must re-register the plugin socket), not a readiness convenience. Since CI uses --no-wait via gpu-operator-install, the rollout wait was being skipped on the exact path that H100 validation depends on.

  7. H100 x2 workflow timeout: Training and conformance jobs on linux-amd64-gpu-h100-latest-2 runners consistently exceed the original timeout due to slow Docker/registry I/O during Build aicr (17-36 min vs 1.5 min on H100 x1) and slow in-cluster image pulls during bundle install (kai-scheduler hook alone takes 10-20+ min). Build aicr was also compiling all 3 validator binaries when only conformance is needed. At 90 min, the pre-validation path still consumed the entire budget on slow runners; bumped to 120 min to provide sufficient headroom.

  8. kai-scheduler hook needs longer timeout: The kai-scheduler crd-upgrader pre-install hook pulls its container image from ghcr.io inside the kind cluster. On H100 x2 runners with degraded I/O, this frequently exceeds the global 10m Helm timeout, triggering unnecessary retry cycles (10m timeout → cleanup → retry). A 20m per-component timeout avoids the wasted first attempt while keeping other components at 10m. Validated: both H100 x2 runs passed kai-scheduler on first attempt with 20m.

Fixes: N/A
Related: #534

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)
  • Other: .github/actions/aicr-build/action.yml, .github/workflows/gpu-h100-{conformance,training,inference}-test.yaml

Implementation Notes

Change Before After
--no-wait hook timeout WAIT_ARGS="" (5-min default) Per-component --timeout ${COMPONENT_HELM_TIMEOUT}
Async component hook timeout COMPONENT_WAIT_ARGS="" (5-min default) --timeout ${COMPONENT_HELM_TIMEOUT} (skip --wait, keep timeout)
Helm timeout architecture Global WAIT_ARGS coupling timeout + wait behavior Three independent concerns: HELM_TIMEOUT (global default), COMPONENT_HELM_TIMEOUT (per-component override), NO_WAIT/ASYNC_COMPONENTS (wait behavior)
kai-scheduler timeout 10m (global default) 20m (per-component override via template {{ if eq .Name "kai-scheduler" }})
Helm hook retry Retries fail due to stale hook Jobs helm_retry() cleans up all non-succeeded hook Jobs (failed, pending, stuck) before each retry
Hook cleanup robustness jsonpath with fragile field alignment Per-Job JSON lookup; deletes any non-succeeded hook Job unconditionally since cleanup only runs after Helm failure
DRA rollout with --no-wait Skipped (lost correctness gate) Always waits — DRA plugin restart is a correctness gate, not a readiness convenience
H100 x2 workflow timeout 45-60 min 120 min (training + conformance). Pre-validation path consumes 77-90 min on slow runners.
Validator builds All 3 phases for every workflow Selective via validator_phases input: training/conformance build only conformance, inference builds deployment,conformance, smoke builds none
kind load timeout (CUDA image) 300s per attempt 600s per attempt
Evidence collection if: always() (runs after any failure) if: always() && steps.bundle-install.outcome == 'success'
--no-wait docs "skip waiting for each component" "keeps --timeout for hooks"
Workflow step name Install GPU operator (bundle) Install runtime bundle (reflects full stack deploy)
Workflow step ID gpu-operator bundle-install
Test coverage No assertions for new functions Tests for helm_retry, cleanup_helm_hooks, HELM_TIMEOUT, NO_WAIT, kai-scheduler 20m override

The build_validators input is kept as a deprecated fallback for out-of-tree callers. When validator_phases is set, it takes precedence.

Testing

GOFLAGS="-mod=vendor" go test -race ./pkg/bundler/deployer/helm/...
# ok  github.com/NVIDIA/aicr/pkg/bundler/deployer/helm  1.893s

GPU CI results across multiple runs:

  • Smoke (T4): ✅ Passed (all runs)
  • Inference (H100 x1): ✅ Passed (all runs)
  • Training (H100 x2): kai-scheduler hook passed on first attempt with 20m timeout (no retry needed). Build aicr 18-27m (prev 24m, selective validators). Bundle install 40-45m on slow runners.
  • Conformance (H100 x2): kai-scheduler hook passed on first attempt with 20m timeout. Build aicr 17-18m (prev 36m — selective validators saved ~19m). Bundle install 40m on slow runners.

H100 x2 runs at 90 min timed out during validation/chainsaw — bumped to 120 min to provide sufficient headroom. The deploy.sh fixes, selective validators, and per-component timeout are all validated end-to-end.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Existing bundles will regenerate with the new per-component timeout, helm_retry, and DRA rollout behavior on next aicr bundle. No migration needed. Out-of-tree callers of aicr-build using build_validators: 'false' continue to work via the deprecated fallback.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 10, 2026 22:21
@yuanchen8911 yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from cb5ea40 to 6e16840 Compare April 10, 2026 22:24
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 10, 2026 22:24
@yuanchen8911 yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from 6e16840 to db94930 Compare April 10, 2026 23:29
@yuanchen8911 yuanchen8911 changed the title fix(bundler): preserve --timeout for hooks with --no-wait fix(ci): improve GPU test reliability and deploy timeout handling Apr 10, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch 2 times, most recently from ec24318 to 9b0fcc5 Compare April 10, 2026 23:43
@yuanchen8911 yuanchen8911 requested a review from xdu31 April 10, 2026 23:43
@yuanchen8911 yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from 9b0fcc5 to 939b162 Compare April 10, 2026 23:46
@xdu31
Copy link
Copy Markdown
Contributor

xdu31 commented Apr 11, 2026

  1. --no-wait behavior change is user-facing (note, not a blocker)
    This changes what --no-wait means for anyone running deploy.sh from a bundle — previously it cleared all wait/timeout args, now it keeps --timeout 10m. This is the right fix (hooks without a timeout fall back to Helm's 5-min default which is insufficient), but worth calling out in the PR description that this is an intentional semantic change.

  2. README template is stale
    README.md.tmpl:53 still describes --no-wait as "skip waiting for each component to become ready". Since the behavior now preserves hook timeouts, the description should be updated. The inline comment on line 8 of deploy.sh.tmpl is updated but the README isn't.

xdu31
xdu31 previously approved these changes Apr 11, 2026
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

  1. --no-wait behavior change is user-facing (note, not a blocker)
    This changes what --no-wait means for anyone running deploy.sh from a bundle — previously it cleared all wait/timeout args, now it keeps --timeout 10m. This is the right fix (hooks without a timeout fall back to Helm's 5-min default which is insufficient), but worth calling out in the PR description that this is an intentional semantic change.
  2. README template is stale
    README.md.tmpl:53 still describes --no-wait as "skip waiting for each component to become ready". Since the behavior now preserves hook timeouts, the description should be updated. The inline comment on line 8 of deploy.sh.tmpl is updated but the README isn't.

Addressed both items:

--no-wait semantic change: Added to PR description. This is intentional — without --timeout, Helm hooks fall back to a 5-min default which is insufficient for cold runners pulling hook images (observed on dynamo-platform ssh-keygen hook).

README template + CLI reference updated: README.md.tmpl:53 and docs/user/cli-reference.md:1158 now describe --no-wait as "keeps --timeout for hooks" instead of the previous generic description.

@yuanchen8911 yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch 4 times, most recently from 772c7fb to efc7407 Compare April 11, 2026 17:36
@github-actions
Copy link
Copy Markdown

@yuanchen8911 yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch 4 times, most recently from 3e1119c to ef8be7b Compare April 11, 2026 20:20
@github-actions github-actions bot added size/L and removed size/M labels Apr 11, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from ef8be7b to eac87b6 Compare April 11, 2026 20:22
- Preserve --timeout for Helm hooks with --no-wait in deploy.sh
- Clean up failed Helm hook Jobs before retry — uses helm.sh/hook
  annotation and status.failed to target only failed hook Jobs
- Preserve --timeout 10m for async components (e.g., kai-scheduler)
  so hook Jobs like crd-upgrader get adequate timeout on cold runners
- Increase kind load timeout from 300s to 600s for the CUDA-based
  smoke-test image; use 120s for tiny validator images
- Add build_validators input to aicr-build action; skip validator
  builds in smoke test to fit within 30m job timeout
- Gate evidence collection on bundle install success so it runs
  after validation/chainsaw failures but skips after infrastructure
  failures (build, cluster setup)
- Bump training and inference workflow timeouts from 45m to 60m
- Rename "Install GPU operator (bundle)" to "Install runtime bundle"
  to reflect that it deploys the full stack, not just GPU operator
- Update --no-wait description in README template and CLI reference
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR — the eight issues are well-diagnosed and the fixes are clean. One real suggestion (jq for annotation parsing), rest are nits/future-proofing. Tests cover the new pieces. LGTM with minor comments.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Addressed all review comments:

  1. Hook annotation check — replaced grep -o '"helm.sh/hook"' on raw JSON with kubectl get job -o jsonpath='{.metadata.annotations.helm\.sh/hook}'. Removed the unused job_json variable. Uses jsonpath (always available) instead of jq (not guaranteed).
  2. Retry message off-by-one — fixed in both retry() and helm_retry(): ${MAX_RETRIES} retries instead of ${attempt} attempts.
  3. Dead none branch in aicr-build — removed the redundant inner check since the step-level if already skips it.
  4. HelmTimeout field — replied inline; agree it's the right move if more overrides appear, keeping narrow for now.

@mchmarny ready for re-review when you get a chance.

@xdu31
Copy link
Copy Markdown
Contributor

xdu31 commented Apr 13, 2026

/lgtm

@mchmarny mchmarny enabled auto-merge (squash) April 13, 2026 17:14
@mchmarny mchmarny merged commit d821306 into NVIDIA:main Apr 13, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants