fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind by yuanchen8911 · Pull Request #563 · NVIDIA/aicr

yuanchen8911 · 2026-04-14T01:46:21Z

Summary

Add hook job/pod diagnostics in cleanup_helm_hooks(), restore Kind Grafana resources to base values, add Grafana failure diagnostics to GPU workflows, and bump inference test timeout from 60 to 90 minutes.

Motivation / Context

Two recurring GPU CI failures:

Dynamo ssh-keygen hook failure (PR fix(ci): replace push path filters with runtime path gate in GPU workflows #558 inference test): The hook Job failed transiently and retry logic (fix(ci): improve GPU test reliability and deploy timeout handling #539) cleaned up and recovered, but the failed pod was deleted before diagnostics were captured, making root-cause analysis impossible. The retry overhead (~20 min) also pushed the workflow past its 60-minute timeout.
Grafana deployment never Available (PR fix(ci): auto-label new issues by area and assign owners #535, fix(ci): replace push path filters with runtime path gate in GPU workflows #558 training tests): The Kind overlay reduced Grafana resources to 50m CPU / 64Mi memory, which was too low for Grafana to initialize. The deployment reported status: {} (no conditions) even after 5 minutes of chainsaw polling.

Fixes: N/A
Related: #558, #539, #541, #535

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: CI workflows, Kind overlay

Implementation Notes

Hook diagnostics: cleanup_helm_hooks() now runs kubectl describe job and kubectl describe pod on failed hook pods before deleting them. Raw container logs are intentionally omitted to avoid leaking sensitive bootstrap material (keys, certs, tokens) into CI output.
Grafana resources: Restored Kind overlay Grafana resources to base values (100m/128Mi requests, 500m/512Mi limits). The previous 50m/64Mi was insufficient for Grafana initialization.
Grafana diagnostics: Added failure-only kubectl get/describe for Grafana deployment and pods to training, inference, and conformance workflows.
Inference timeout: Bumped from 60→90 min to absorb retry overhead from transient hook failures.

Testing

go test ./pkg/bundler/deployer/helm/...  # PASS
yamllint -c .yamllint.yaml recipes/overlays/kind.yaml  # PASS
yamllint -c .yamllint.yaml .github/workflows/gpu-h100-training-test.yaml  # PASS
yamllint -c .yamllint.yaml .github/workflows/gpu-h100-inference-test.yaml  # PASS
yamllint -c .yamllint.yaml .github/workflows/gpu-h100-conformance-test.yaml  # PASS

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — diagnostics are additive, resource bump restores base values, timeout bump is safe to revert.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

mchmarny

Clean, targeted fix for two real CI issues. Each change is well-scoped and easy to revert independently.

Looks good:

Hook diagnostics capture describe (not logs) before cleanup — avoids leaking sensitive bootstrap material while preserving events and pod conditions
Grafana resource bump from 50m/64Mi to 100m/128Mi requests restores values that actually let Grafana initialize
--skip-preflight is not relevant here (that's #561) — the diagnostics are purely additive
Grafana diagnostics gated on if: failure() — zero noise on green runs

Minor observations (non-blocking):

The Grafana diagnostic block is copy-pasted across 3 workflows — worth extracting to a composite action if more components get similar treatment
tail -30/tail -40 on describe output may truncate useful conditions if the events section is long — not a blocker, just something to revisit if diagnostics prove insufficient
The 60→90 min timeout bump is justified by retry overhead, but worth monitoring that normal runs don't creep close to that ceiling

pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

recipes/overlays/kind.yaml

.github/workflows/gpu-h100-inference-test.yaml

.github/workflows/gpu-h100-training-test.yaml

Add job/pod describe output in cleanup_helm_hooks() before deleting failed hook Jobs, so transient failures (e.g., dynamo ssh-keygen) are diagnosable from CI logs. Bump GPU inference test timeout from 60 to 90 minutes to absorb retry overhead when hook retries succeed but consume time budget. Restore Kind overlay Grafana resources to base values (100m/128Mi requests, 500m/512Mi limits) — the reduced 50m/64Mi was causing Grafana to never become Available in Kind clusters. Add Grafana deployment/pod diagnostics to training, inference, and conformance workflow failure steps.

yuanchen8911 · 2026-04-14T15:46:17Z

Addressed, thanks.

Increased both kubectl describe truncation limits to tail -50 so conditions/events are less likely to be pushed out by a long event section.
Updated hook pod diagnostics to iterate all pods for job-name=${name} instead of only the first pod, so we also capture retries if a hook Job creates multiple failed pods.
On the timeout bump: recent successful inference runs are still well below the new 90 min ceiling, so this remains a safety net for retry-heavy flakes rather than the expected runtime.
Agreed on the duplicated Grafana diagnostics. If this expands beyond the current three workflows, I'll extract it into a shared action or helper script.

yuanchen8911 requested review from a team as code owners April 14, 2026 01:46

yuanchen8911 added enhancement New feature or request area/ci labels Apr 14, 2026

github-actions bot added area/bundler size/S labels Apr 14, 2026

yuanchen8911 requested a review from mchmarny April 14, 2026 01:50

mchmarny assigned yuanchen8911 Apr 14, 2026

yuanchen8911 force-pushed the fix/gpu-hook-diagnostics branch from a3defaa to 9a3452b Compare April 14, 2026 15:28

github-actions bot added area/recipes size/M and removed size/S labels Apr 14, 2026

yuanchen8911 changed the title ~~fix(ci): add hook failure diagnostics and bump inference timeout~~ fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind Apr 14, 2026

mchmarny previously approved these changes Apr 14, 2026

View reviewed changes

yuanchen8911 dismissed mchmarny’s stale review via f6a5713 April 14, 2026 15:42

yuanchen8911 force-pushed the fix/gpu-hook-diagnostics branch from e389cfd to f6a5713 Compare April 14, 2026 15:42

yuanchen8911 requested a review from mchmarny April 14, 2026 15:47

mchmarny approved these changes Apr 14, 2026

View reviewed changes

Merge branch 'main' into fix/gpu-hook-diagnostics

5896b35

mchmarny enabled auto-merge (squash) April 14, 2026 15:52

mchmarny merged commit db9f3ab into NVIDIA:main Apr 14, 2026
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind#563

fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind#563
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/gpu-hook-diagnostics

yuanchen8911 commented Apr 14, 2026 •

edited

Loading

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 14, 2026 •

edited

Loading