fix(ci): improve GPU test reliability and deploy timeout handling by yuanchen8911 · Pull Request #539 · NVIDIA/aicr

yuanchen8911 · 2026-04-10T22:21:21Z

Summary

Fix deploy.sh hook timeout handling, hook Job cleanup on retry, and multiple GPU CI reliability issues (kind load timeouts, evidence gating, workflow timeouts, selective validator builds, per-component Helm timeouts).

Motivation / Context

Eight related GPU CI reliability issues:

Helm hook timeout: The dynamo-platform chart has a pre-install hook Job (ssh-keygen) that must complete before install. When --no-wait cleared both --wait and --timeout, Helm fell back to its 5-minute default — insufficient on cold runners where the hook image must be pulled.
Failed/stuck hook Jobs block retries: When a Helm hook Job (e.g., kai-scheduler crd-upgrader) times out, it stays in the namespace and blocks subsequent helm upgrade --install retries with "Job not ready" errors. This includes both failed Jobs and Jobs stuck in Pending (where Helm's --timeout expired before the pod started). Observed on PR fix(ci): auto-label new issues by area and assign owners #535 conformance test.
Async components lost hook timeout: ASYNC_COMPONENTS (kai-scheduler) cleared COMPONENT_WAIT_ARGS entirely, removing --timeout along with --wait. Hook Jobs on async components fell back to Helm's 5-minute default.
Kind image load timeout: kind load docker-image had a 300s timeout per attempt. On slow runners, loading the ~250MB CUDA base image to a 2-node cluster exceeded this — both attempts timed out (exit code 124).
Evidence collection after failure: Collect AI conformance evidence used if: always(), causing it to run (and fail) even when earlier steps like Build aicr failed.
DRA rollout skipped under --no-wait: The DRA kubelet plugin restart+rollout-status is a correctness gate (kubelet must re-register the plugin socket), not a readiness convenience. Since CI uses --no-wait via gpu-operator-install, the rollout wait was being skipped on the exact path that H100 validation depends on.
H100 x2 workflow timeout: Training and conformance jobs on linux-amd64-gpu-h100-latest-2 runners consistently exceed the original timeout due to slow Docker/registry I/O during Build aicr (17-36 min vs 1.5 min on H100 x1) and slow in-cluster image pulls during bundle install (kai-scheduler hook alone takes 10-20+ min). Build aicr was also compiling all 3 validator binaries when only conformance is needed. At 90 min, the pre-validation path still consumed the entire budget on slow runners; bumped to 120 min to provide sufficient headroom.
kai-scheduler hook needs longer timeout: The kai-scheduler crd-upgrader pre-install hook pulls its container image from ghcr.io inside the kind cluster. On H100 x2 runners with degraded I/O, this frequently exceeds the global 10m Helm timeout, triggering unnecessary retry cycles (10m timeout → cleanup → retry). A 20m per-component timeout avoids the wasted first attempt while keeping other components at 10m. Validated: both H100 x2 runs passed kai-scheduler on first attempt with 20m.

Fixes: N/A
Related: #534

Type of Change

Bug fix (non-breaking change that fixes an issue)
Build/CI/tooling

Component(s) Affected

Bundlers (pkg/bundler, pkg/component/*)
Other: .github/actions/aicr-build/action.yml, .github/workflows/gpu-h100-{conformance,training,inference}-test.yaml

Implementation Notes

Change	Before	After
`--no-wait` hook timeout	`WAIT_ARGS=""` (5-min default)	Per-component `--timeout ${COMPONENT_HELM_TIMEOUT}`
Async component hook timeout	`COMPONENT_WAIT_ARGS=""` (5-min default)	`--timeout ${COMPONENT_HELM_TIMEOUT}` (skip `--wait`, keep timeout)
Helm timeout architecture	Global `WAIT_ARGS` coupling timeout + wait behavior	Three independent concerns: `HELM_TIMEOUT` (global default), `COMPONENT_HELM_TIMEOUT` (per-component override), `NO_WAIT`/`ASYNC_COMPONENTS` (wait behavior)
kai-scheduler timeout	10m (global default)	20m (per-component override via template `{{ if eq .Name "kai-scheduler" }}`)
Helm hook retry	Retries fail due to stale hook Jobs	`helm_retry()` cleans up all non-succeeded hook Jobs (failed, pending, stuck) before each retry
Hook cleanup robustness	jsonpath with fragile field alignment	Per-Job JSON lookup; deletes any non-succeeded hook Job unconditionally since cleanup only runs after Helm failure
DRA rollout with `--no-wait`	Skipped (lost correctness gate)	Always waits — DRA plugin restart is a correctness gate, not a readiness convenience
H100 x2 workflow timeout	45-60 min	120 min (training + conformance). Pre-validation path consumes 77-90 min on slow runners.
Validator builds	All 3 phases for every workflow	Selective via `validator_phases` input: training/conformance build only `conformance`, inference builds `deployment,conformance`, smoke builds none
`kind load` timeout (CUDA image)	300s per attempt	600s per attempt
Evidence collection	`if: always()` (runs after any failure)	`if: always() && steps.bundle-install.outcome == 'success'`
`--no-wait` docs	"skip waiting for each component"	"keeps `--timeout` for hooks"
Workflow step name	`Install GPU operator (bundle)`	`Install runtime bundle` (reflects full stack deploy)
Workflow step ID	`gpu-operator`	`bundle-install`
Test coverage	No assertions for new functions	Tests for `helm_retry`, `cleanup_helm_hooks`, `HELM_TIMEOUT`, `NO_WAIT`, kai-scheduler 20m override

The build_validators input is kept as a deprecated fallback for out-of-tree callers. When validator_phases is set, it takes precedence.

Testing

GOFLAGS="-mod=vendor" go test -race ./pkg/bundler/deployer/helm/...
# ok  github.com/NVIDIA/aicr/pkg/bundler/deployer/helm  1.893s

GPU CI results across multiple runs:

Smoke (T4): ✅ Passed (all runs)
Inference (H100 x1): ✅ Passed (all runs)
Training (H100 x2): kai-scheduler hook passed on first attempt with 20m timeout (no retry needed). Build aicr 18-27m (prev 24m, selective validators). Bundle install 40-45m on slow runners.
Conformance (H100 x2): kai-scheduler hook passed on first attempt with 20m timeout. Build aicr 17-18m (prev 36m — selective validators saved ~19m). Bundle install 40m on slow runners.

H100 x2 runs at 90 min timed out during validation/chainsaw — bumped to 120 min to provide sufficient headroom. The deploy.sh fixes, selective validators, and per-component timeout are all validated end-to-end.

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: Existing bundles will regenerate with the new per-component timeout, helm_retry, and DRA rollout behavior on next aicr bundle. No migration needed. Out-of-tree callers of aicr-build using build_validators: 'false' continue to work via the deprecated fallback.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

xdu31 · 2026-04-11T00:07:18Z

--no-wait behavior change is user-facing (note, not a blocker)
This changes what --no-wait means for anyone running deploy.sh from a bundle — previously it cleared all wait/timeout args, now it keeps --timeout 10m. This is the right fix (hooks without a timeout fall back to Helm's 5-min default which is insufficient), but worth calling out in the PR description that this is an intentional semantic change.
README template is stale
README.md.tmpl:53 still describes --no-wait as "skip waiting for each component to become ready". Since the behavior now preserves hook timeouts, the description should be updated. The inline comment on line 8 of deploy.sh.tmpl is updated but the README isn't.

yuanchen8911 · 2026-04-11T01:11:53Z

--no-wait behavior change is user-facing (note, not a blocker)
This changes what --no-wait means for anyone running deploy.sh from a bundle — previously it cleared all wait/timeout args, now it keeps --timeout 10m. This is the right fix (hooks without a timeout fall back to Helm's 5-min default which is insufficient), but worth calling out in the PR description that this is an intentional semantic change.

README template is stale
README.md.tmpl:53 still describes --no-wait as "skip waiting for each component to become ready". Since the behavior now preserves hook timeouts, the description should be updated. The inline comment on line 8 of deploy.sh.tmpl is updated but the README isn't.

Addressed both items:

--no-wait semantic change: Added to PR description. This is intentional — without --timeout, Helm hooks fall back to a 5-min default which is insufficient for cold runners pulling hook images (observed on dynamo-platform ssh-keygen hook).

README template + CLI reference updated: README.md.tmpl:53 and docs/user/cli-reference.md:1158 now describe --no-wait as "keeps --timeout for hooks" instead of the previous generic description.

github-actions · 2026-04-11T17:36:44Z

🌿 Preview your docs: https://nvidia-preview-fix-ci-helm-hook-timeout.docs.buildwithfern.com

- Preserve --timeout for Helm hooks with --no-wait in deploy.sh - Clean up failed Helm hook Jobs before retry — uses helm.sh/hook annotation and status.failed to target only failed hook Jobs - Preserve --timeout 10m for async components (e.g., kai-scheduler) so hook Jobs like crd-upgrader get adequate timeout on cold runners - Increase kind load timeout from 300s to 600s for the CUDA-based smoke-test image; use 120s for tiny validator images - Add build_validators input to aicr-build action; skip validator builds in smoke test to fit within 30m job timeout - Gate evidence collection on bundle install success so it runs after validation/chainsaw failures but skips after infrastructure failures (build, cluster setup) - Bump training and inference workflow timeouts from 45m to 60m - Rename "Install GPU operator (bundle)" to "Install runtime bundle" to reflect that it deploys the full stack, not just GPU operator - Update --no-wait description in README template and CLI reference

mchmarny

Solid PR — the eight issues are well-diagnosed and the fixes are clean. One real suggestion (jq for annotation parsing), rest are nits/future-proofing. Tests cover the new pieces. LGTM with minor comments.

pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

.github/actions/aicr-build/action.yml

yuanchen8911 · 2026-04-13T17:09:36Z

Addressed all review comments:

Hook annotation check — replaced grep -o '"helm.sh/hook"' on raw JSON with kubectl get job -o jsonpath='{.metadata.annotations.helm\.sh/hook}'. Removed the unused job_json variable. Uses jsonpath (always available) instead of jq (not guaranteed).
Retry message off-by-one — fixed in both retry() and helm_retry(): ${MAX_RETRIES} retries instead of ${attempt} attempts.
Dead none branch in aicr-build — removed the redundant inner check since the step-level if already skips it.
HelmTimeout field — replied inline; agree it's the right move if more overrides appear, keeping narrow for now.

@mchmarny ready for re-review when you get a chance.

xdu31 · 2026-04-13T17:10:57Z

/lgtm

yuanchen8911 added bug area/bundler labels Apr 10, 2026

yuanchen8911 requested a review from a team as a code owner April 10, 2026 22:21

github-actions bot added the size/XS label Apr 10, 2026

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from cb5ea40 to 6e16840 Compare April 10, 2026 22:24

yuanchen8911 requested a review from a team as a code owner April 10, 2026 22:24

github-actions bot added area/ci size/S and removed size/XS labels Apr 10, 2026

yuanchen8911 requested review from mchmarny April 10, 2026 22:41

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from 6e16840 to db94930 Compare April 10, 2026 23:29

yuanchen8911 changed the title ~~fix(bundler): preserve --timeout for hooks with --no-wait~~ fix(ci): improve GPU test reliability and deploy timeout handling Apr 10, 2026

yuanchen8911 mentioned this pull request Apr 10, 2026

GPU CI: reduce flakiness, runtime, and duplication #540

Closed

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch 2 times, most recently from ec24318 to 9b0fcc5 Compare April 10, 2026 23:43

yuanchen8911 requested a review from xdu31 April 10, 2026 23:43

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from 9b0fcc5 to 939b162 Compare April 10, 2026 23:46

xdu31 previously approved these changes Apr 11, 2026

View reviewed changes

yuanchen8911 dismissed xdu31’s stale review via c2d9e17 April 11, 2026 01:02

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from 939b162 to c2d9e17 Compare April 11, 2026 01:02

github-actions bot added the area/docs label Apr 11, 2026

yuanchen8911 requested a review from xdu31 April 11, 2026 01:04

yuanchen8911 mentioned this pull request Apr 11, 2026

GPU CI: optimize build/deploy performance, deduplicate workflows, and simplify behavioral tests #541

Closed

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from c2d9e17 to 942e980 Compare April 11, 2026 03:24

mchmarny removed the bug label Apr 11, 2026

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from 942e980 to f0fba28 Compare April 11, 2026 14:47

github-actions bot removed the size/S label Apr 11, 2026

github-actions bot added the size/M label Apr 11, 2026

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch 4 times, most recently from 772c7fb to efc7407 Compare April 11, 2026 17:36

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch 4 times, most recently from 3e1119c to ef8be7b Compare April 11, 2026 20:20

github-actions bot added size/L and removed size/M labels Apr 11, 2026

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from ef8be7b to eac87b6 Compare April 11, 2026 20:22

yuanchen8911 force-pushed the fix/ci-helm-hook-timeout branch from eac87b6 to 6338c63 Compare April 12, 2026 01:09

This was referenced Apr 13, 2026

feat(ci): image build and deployment optimization #553

Open

feat(ci): remove standalone conformance workflow #554

Open

feat(ci): harden inference waits and extract debug diagnostics #555

Open

mchmarny reviewed Apr 13, 2026

View reviewed changes

Merge branch 'main' into fix/ci-helm-hook-timeout

60768fa

mchmarny approved these changes Apr 13, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) April 13, 2026 17:14

xdu31 approved these changes Apr 13, 2026

View reviewed changes

mchmarny merged commit d821306 into NVIDIA:main Apr 13, 2026
25 checks passed

This was referenced Apr 13, 2026

fix(ci): replace push path filters with runtime path gate in GPU workflows #558

Open

fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind #563

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): improve GPU test reliability and deploy timeout handling#539

fix(ci): improve GPU test reliability and deploy timeout handling#539
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/ci-helm-hook-timeout

yuanchen8911 commented Apr 10, 2026 •

edited

Loading

Uh oh!

xdu31 commented Apr 11, 2026

Uh oh!

yuanchen8911 commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 13, 2026

Uh oh!

xdu31 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

xdu31 commented Apr 11, 2026

Uh oh!

yuanchen8911 commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 13, 2026

Uh oh!

xdu31 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 10, 2026 •

edited

Loading