Skip to content

fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind#563

Merged
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/gpu-hook-diagnostics
Apr 14, 2026
Merged

fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind#563
mchmarny merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:fix/gpu-hook-diagnostics

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 14, 2026

Summary

Add hook job/pod diagnostics in cleanup_helm_hooks(), restore Kind Grafana resources to base values, add Grafana failure diagnostics to GPU workflows, and bump inference test timeout from 60 to 90 minutes.

Motivation / Context

Two recurring GPU CI failures:

  1. Dynamo ssh-keygen hook failure (PR fix(ci): replace push path filters with runtime path gate in GPU workflows #558 inference test): The hook Job failed transiently and retry logic (fix(ci): improve GPU test reliability and deploy timeout handling #539) cleaned up and recovered, but the failed pod was deleted before diagnostics were captured, making root-cause analysis impossible. The retry overhead (~20 min) also pushed the workflow past its 60-minute timeout.

  2. Grafana deployment never Available (PR fix(ci): auto-label new issues by area and assign owners #535, fix(ci): replace push path filters with runtime path gate in GPU workflows #558 training tests): The Kind overlay reduced Grafana resources to 50m CPU / 64Mi memory, which was too low for Grafana to initialize. The deployment reported status: {} (no conditions) even after 5 minutes of chainsaw polling.

Fixes: N/A
Related: #558, #539, #541, #535

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: CI workflows, Kind overlay

Implementation Notes

  • Hook diagnostics: cleanup_helm_hooks() now runs kubectl describe job and kubectl describe pod on failed hook pods before deleting them. Raw container logs are intentionally omitted to avoid leaking sensitive bootstrap material (keys, certs, tokens) into CI output.
  • Grafana resources: Restored Kind overlay Grafana resources to base values (100m/128Mi requests, 500m/512Mi limits). The previous 50m/64Mi was insufficient for Grafana initialization.
  • Grafana diagnostics: Added failure-only kubectl get/describe for Grafana deployment and pods to training, inference, and conformance workflows.
  • Inference timeout: Bumped from 60→90 min to absorb retry overhead from transient hook failures.

Testing

go test ./pkg/bundler/deployer/helm/...  # PASS
yamllint -c .yamllint.yaml recipes/overlays/kind.yaml  # PASS
yamllint -c .yamllint.yaml .github/workflows/gpu-h100-training-test.yaml  # PASS
yamllint -c .yamllint.yaml .github/workflows/gpu-h100-inference-test.yaml  # PASS
yamllint -c .yamllint.yaml .github/workflows/gpu-h100-conformance-test.yaml  # PASS

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — diagnostics are additive, resource bump restores base values, timeout bump is safe to revert.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested review from a team as code owners April 14, 2026 01:46
@yuanchen8911 yuanchen8911 added enhancement New feature or request area/ci labels Apr 14, 2026
@yuanchen8911 yuanchen8911 requested a review from mchmarny April 14, 2026 01:50
@yuanchen8911 yuanchen8911 force-pushed the fix/gpu-hook-diagnostics branch from a3defaa to 9a3452b Compare April 14, 2026 15:28
@yuanchen8911 yuanchen8911 changed the title fix(ci): add hook failure diagnostics and bump inference timeout fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind Apr 14, 2026
mchmarny
mchmarny previously approved these changes Apr 14, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, targeted fix for two real CI issues. Each change is well-scoped and easy to revert independently.

Looks good:

  • Hook diagnostics capture describe (not logs) before cleanup — avoids leaking sensitive bootstrap material while preserving events and pod conditions
  • Grafana resource bump from 50m/64Mi to 100m/128Mi requests restores values that actually let Grafana initialize
  • --skip-preflight is not relevant here (that's #561) — the diagnostics are purely additive
  • Grafana diagnostics gated on if: failure() — zero noise on green runs

Minor observations (non-blocking):

  • The Grafana diagnostic block is copy-pasted across 3 workflows — worth extracting to a composite action if more components get similar treatment
  • tail -30/tail -40 on describe output may truncate useful conditions if the events section is long — not a blocker, just something to revisit if diagnostics prove insufficient
  • The 60→90 min timeout bump is justified by retry overhead, but worth monitoring that normal runs don't creep close to that ceiling

Add job/pod describe output in cleanup_helm_hooks() before deleting
failed hook Jobs, so transient failures (e.g., dynamo ssh-keygen) are
diagnosable from CI logs. Bump GPU inference test timeout from 60 to
90 minutes to absorb retry overhead when hook retries succeed but
consume time budget.

Restore Kind overlay Grafana resources to base values (100m/128Mi
requests, 500m/512Mi limits) — the reduced 50m/64Mi was causing
Grafana to never become Available in Kind clusters. Add Grafana
deployment/pod diagnostics to training, inference, and conformance
workflow failure steps.
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Addressed, thanks.

  • Increased both kubectl describe truncation limits to tail -50 so conditions/events are less likely to be pushed out by a long event section.
  • Updated hook pod diagnostics to iterate all pods for job-name=${name} instead of only the first pod, so we also capture retries if a hook Job creates multiple failed pods.
  • On the timeout bump: recent successful inference runs are still well below the new 90 min ceiling, so this remains a safety net for retry-heavy flakes rather than the expected runtime.
  • Agreed on the duplicated Grafana diagnostics. If this expands beyond the current three workflows, I'll extract it into a shared action or helper script.

@yuanchen8911 yuanchen8911 requested a review from mchmarny April 14, 2026 15:47
@mchmarny mchmarny enabled auto-merge (squash) April 14, 2026 15:52
@mchmarny mchmarny merged commit db9f3ab into NVIDIA:main Apr 14, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants