fix(ci): improve GPU test reliability and deploy timeout handling#539
fix(ci): improve GPU test reliability and deploy timeout handling#539mchmarny merged 2 commits intoNVIDIA:mainfrom
Conversation
cb5ea40 to
6e16840
Compare
6e16840 to
db94930
Compare
ec24318 to
9b0fcc5
Compare
9b0fcc5 to
939b162
Compare
|
939b162 to
c2d9e17
Compare
Addressed both items: --no-wait semantic change: Added to PR description. This is intentional — without --timeout, Helm hooks fall back to a 5-min default which is insufficient for cold runners pulling hook images (observed on dynamo-platform ssh-keygen hook). README template + CLI reference updated: README.md.tmpl:53 and docs/user/cli-reference.md:1158 now describe --no-wait as "keeps --timeout for hooks" instead of the previous generic description. |
c2d9e17 to
942e980
Compare
942e980 to
f0fba28
Compare
772c7fb to
efc7407
Compare
|
🌿 Preview your docs: https://nvidia-preview-fix-ci-helm-hook-timeout.docs.buildwithfern.com |
3e1119c to
ef8be7b
Compare
ef8be7b to
eac87b6
Compare
- Preserve --timeout for Helm hooks with --no-wait in deploy.sh - Clean up failed Helm hook Jobs before retry — uses helm.sh/hook annotation and status.failed to target only failed hook Jobs - Preserve --timeout 10m for async components (e.g., kai-scheduler) so hook Jobs like crd-upgrader get adequate timeout on cold runners - Increase kind load timeout from 300s to 600s for the CUDA-based smoke-test image; use 120s for tiny validator images - Add build_validators input to aicr-build action; skip validator builds in smoke test to fit within 30m job timeout - Gate evidence collection on bundle install success so it runs after validation/chainsaw failures but skips after infrastructure failures (build, cluster setup) - Bump training and inference workflow timeouts from 45m to 60m - Rename "Install GPU operator (bundle)" to "Install runtime bundle" to reflect that it deploys the full stack, not just GPU operator - Update --no-wait description in README template and CLI reference
eac87b6 to
6338c63
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Solid PR — the eight issues are well-diagnosed and the fixes are clean. One real suggestion (jq for annotation parsing), rest are nits/future-proofing. Tests cover the new pieces. LGTM with minor comments.
|
Addressed all review comments:
@mchmarny ready for re-review when you get a chance. |
|
/lgtm |
Summary
Fix deploy.sh hook timeout handling, hook Job cleanup on retry, and multiple GPU CI reliability issues (kind load timeouts, evidence gating, workflow timeouts, selective validator builds, per-component Helm timeouts).
Motivation / Context
Eight related GPU CI reliability issues:
Helm hook timeout: The
dynamo-platformchart has a pre-install hook Job (ssh-keygen) that must complete before install. When--no-waitcleared both--waitand--timeout, Helm fell back to its 5-minute default — insufficient on cold runners where the hook image must be pulled.Failed/stuck hook Jobs block retries: When a Helm hook Job (e.g., kai-scheduler
crd-upgrader) times out, it stays in the namespace and blocks subsequenthelm upgrade --installretries with "Job not ready" errors. This includes both failed Jobs and Jobs stuck in Pending (where Helm's--timeoutexpired before the pod started). Observed on PR fix(ci): auto-label new issues by area and assign owners #535 conformance test.Async components lost hook timeout:
ASYNC_COMPONENTS(kai-scheduler) clearedCOMPONENT_WAIT_ARGSentirely, removing--timeoutalong with--wait. Hook Jobs on async components fell back to Helm's 5-minute default.Kind image load timeout:
kind load docker-imagehad a 300s timeout per attempt. On slow runners, loading the ~250MB CUDA base image to a 2-node cluster exceeded this — both attempts timed out (exit code 124).Evidence collection after failure:
Collect AI conformance evidenceusedif: always(), causing it to run (and fail) even when earlier steps like Build aicr failed.DRA rollout skipped under
--no-wait: The DRA kubelet plugin restart+rollout-status is a correctness gate (kubelet must re-register the plugin socket), not a readiness convenience. Since CI uses--no-waitviagpu-operator-install, the rollout wait was being skipped on the exact path that H100 validation depends on.H100 x2 workflow timeout: Training and conformance jobs on
linux-amd64-gpu-h100-latest-2runners consistently exceed the original timeout due to slow Docker/registry I/O during Build aicr (17-36 min vs 1.5 min on H100 x1) and slow in-cluster image pulls during bundle install (kai-scheduler hook alone takes 10-20+ min). Build aicr was also compiling all 3 validator binaries when only conformance is needed. At 90 min, the pre-validation path still consumed the entire budget on slow runners; bumped to 120 min to provide sufficient headroom.kai-scheduler hook needs longer timeout: The kai-scheduler
crd-upgraderpre-install hook pulls its container image fromghcr.ioinside the kind cluster. On H100 x2 runners with degraded I/O, this frequently exceeds the global 10m Helm timeout, triggering unnecessary retry cycles (10m timeout → cleanup → retry). A 20m per-component timeout avoids the wasted first attempt while keeping other components at 10m. Validated: both H100 x2 runs passed kai-scheduler on first attempt with 20m.Fixes: N/A
Related: #534
Type of Change
Component(s) Affected
pkg/bundler,pkg/component/*).github/actions/aicr-build/action.yml,.github/workflows/gpu-h100-{conformance,training,inference}-test.yamlImplementation Notes
--no-waithook timeoutWAIT_ARGS=""(5-min default)--timeout ${COMPONENT_HELM_TIMEOUT}COMPONENT_WAIT_ARGS=""(5-min default)--timeout ${COMPONENT_HELM_TIMEOUT}(skip--wait, keep timeout)WAIT_ARGScoupling timeout + wait behaviorHELM_TIMEOUT(global default),COMPONENT_HELM_TIMEOUT(per-component override),NO_WAIT/ASYNC_COMPONENTS(wait behavior){{ if eq .Name "kai-scheduler" }})helm_retry()cleans up all non-succeeded hook Jobs (failed, pending, stuck) before each retry--no-waitvalidator_phasesinput: training/conformance build onlyconformance, inference buildsdeployment,conformance, smoke builds nonekind loadtimeout (CUDA image)if: always()(runs after any failure)if: always() && steps.bundle-install.outcome == 'success'--no-waitdocs--timeoutfor hooks"Install GPU operator (bundle)Install runtime bundle(reflects full stack deploy)gpu-operatorbundle-installhelm_retry,cleanup_helm_hooks,HELM_TIMEOUT,NO_WAIT, kai-scheduler 20m overrideThe
build_validatorsinput is kept as a deprecated fallback for out-of-tree callers. Whenvalidator_phasesis set, it takes precedence.Testing
GPU CI results across multiple runs:
H100 x2 runs at 90 min timed out during validation/chainsaw — bumped to 120 min to provide sufficient headroom. The deploy.sh fixes, selective validators, and per-component timeout are all validated end-to-end.
Risk Assessment
Rollout notes: Existing bundles will regenerate with the new per-component timeout,
helm_retry, and DRA rollout behavior on nextaicr bundle. No migration needed. Out-of-tree callers ofaicr-buildusingbuild_validators: 'false'continue to work via the deprecated fallback.Checklist
make testwith-race)make lint)git commit -S)