fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind#563
Merged
mchmarny merged 2 commits intoNVIDIA:mainfrom Apr 14, 2026
Merged
Conversation
a3defaa to
9a3452b
Compare
mchmarny
previously approved these changes
Apr 14, 2026
Member
mchmarny
left a comment
There was a problem hiding this comment.
Clean, targeted fix for two real CI issues. Each change is well-scoped and easy to revert independently.
Looks good:
- Hook diagnostics capture
describe(notlogs) before cleanup — avoids leaking sensitive bootstrap material while preserving events and pod conditions - Grafana resource bump from
50m/64Mito100m/128Mirequests restores values that actually let Grafana initialize --skip-preflightis not relevant here (that's #561) — the diagnostics are purely additive- Grafana diagnostics gated on
if: failure()— zero noise on green runs
Minor observations (non-blocking):
- The Grafana diagnostic block is copy-pasted across 3 workflows — worth extracting to a composite action if more components get similar treatment
tail -30/tail -40ondescribeoutput may truncate useful conditions if the events section is long — not a blocker, just something to revisit if diagnostics prove insufficient- The 60→90 min timeout bump is justified by retry overhead, but worth monitoring that normal runs don't creep close to that ceiling
Add job/pod describe output in cleanup_helm_hooks() before deleting failed hook Jobs, so transient failures (e.g., dynamo ssh-keygen) are diagnosable from CI logs. Bump GPU inference test timeout from 60 to 90 minutes to absorb retry overhead when hook retries succeed but consume time budget. Restore Kind overlay Grafana resources to base values (100m/128Mi requests, 500m/512Mi limits) — the reduced 50m/64Mi was causing Grafana to never become Available in Kind clusters. Add Grafana deployment/pod diagnostics to training, inference, and conformance workflow failure steps.
e389cfd to
f6a5713
Compare
Contributor
Author
|
Addressed, thanks.
|
mchmarny
approved these changes
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add hook job/pod diagnostics in
cleanup_helm_hooks(), restore Kind Grafana resources to base values, add Grafana failure diagnostics to GPU workflows, and bump inference test timeout from 60 to 90 minutes.Motivation / Context
Two recurring GPU CI failures:
Dynamo ssh-keygen hook failure (PR fix(ci): replace push path filters with runtime path gate in GPU workflows #558 inference test): The hook Job failed transiently and retry logic (fix(ci): improve GPU test reliability and deploy timeout handling #539) cleaned up and recovered, but the failed pod was deleted before diagnostics were captured, making root-cause analysis impossible. The retry overhead (~20 min) also pushed the workflow past its 60-minute timeout.
Grafana deployment never Available (PR fix(ci): auto-label new issues by area and assign owners #535, fix(ci): replace push path filters with runtime path gate in GPU workflows #558 training tests): The Kind overlay reduced Grafana resources to
50m CPU / 64Mi memory, which was too low for Grafana to initialize. The deployment reportedstatus: {}(no conditions) even after 5 minutes of chainsaw polling.Fixes: N/A
Related: #558, #539, #541, #535
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
cleanup_helm_hooks()now runskubectl describe jobandkubectl describe podon failed hook pods before deleting them. Raw container logs are intentionally omitted to avoid leaking sensitive bootstrap material (keys, certs, tokens) into CI output.100m/128Mirequests,500m/512Milimits). The previous50m/64Miwas insufficient for Grafana initialization.kubectl get/describefor Grafana deployment and pods to training, inference, and conformance workflows.Testing
Risk Assessment
Rollout notes: N/A — diagnostics are additive, resource bump restores base values, timeout bump is safe to revert.
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info