fix(ci): deduplicate conformance coverage in GPU CI by yuanchen8911 · Pull Request #577 · NVIDIA/aicr

yuanchen8911 · 2026-04-14T22:54:49Z

Summary

Deduplicate conformance coverage in GPU CI by removing the standalone H100 conformance workflow and preserving the unique trigger coverage it carried for the remaining GPU training and inference workflows.

Motivation / Context

The standalone H100x2 conformance workflow duplicates the conformance coverage already exercised by the training and inference workflows. Deleting it reduces redundant GPU CI usage, while consolidating the trigger coverage that would otherwise be lost into the two surviving workflows.

This PR is intentionally narrow in runtime behavior. It removes duplicate conformance execution without changing the training or inference workflow steps, and it leaves broader GPU CI symmetry work to a follow-up.

Fixes: #554
Issue: #554
Related: #541

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: GitHub Actions workflows

Implementation Notes

Deletes .github/workflows/gpu-h100-conformance-test.yaml.
Adds .github/actions/setup-build-tools/** to both remaining GPU workflow path filters because the deleted workflow was the only one carrying that trigger coverage.
Adds the shared tests/chainsaw/ai-conformance helper and imported assert files that the training workflow executes indirectly via kind-training/chainsaw-test.yaml, so deleting the standalone workflow does not create a trigger gap for training.
Leaves the training and inference workflow runtime behavior unchanged in this PR.
Defers broader GPU CI symmetry and path-filter cleanup to follow-up work.

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

make test ✅
make lint ✅
make qualify ❌ at make scan
make scan currently reports dependency vulnerabilities in the local environment:
- Go stdlib go1.26.1 CVEs fixed in 1.26.2 / 1.25.9
- pygments 2.19.2 fixed in 2.20.0

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

yuanchen8911 · 2026-04-14T23:35:01Z

Follow-up proposal after this narrow #554 dedup PR:

The remaining GPU CI shape should be made symmetric so that training and inference both run the same core conformance coverage, with only platform-specific controller/gateway coverage and the inference smoke-test tail differing.

Proposed follow-up changes:

Training

Add recipes/overlays/h100-kind-training-kubeflow.yaml
Base it on h100-kind-training
Set platform: kubeflow
Add the platform-kubeflow mixin so the recipe includes kubeflow-trainer
Update the training workflow to pass platform: kubeflow
Expand kind training conformance checks to include:
- robust-controller
- secure-accelerator-access

Inference

Move the inference GPU workflow from H100 x1 to H100 x2
Raise min_gpu_count to 2
Raise timeout-minutes from 90 to 120
Add gang-scheduling to the kind inference recipes
Remove dead deployment-phase plumbing:
- validator_phases: 'deployment,conformance' -> validator_phases: 'conformance'
- --phase deployment --phase conformance -> --phase conformance
Keep conformance before the Dynamo smoke test

Chainsaw / assert coverage

Add Kubeflow Trainer chainsaw/assert coverage for the training workflow, since robust-controller alone should not be the only Kubeflow surface:
- trainer controller deployment
- validating webhook
- TrainJob CRD

Target steady state:

Shared conformance checks in both training and inference:
- platform-health
- gpu-operator-health
- dra-support
- accelerator-metrics
- ai-service-metrics
- gang-scheduling
- secure-accelerator-access
- pod-autoscaling
- cluster-autoscaling
- robust-controller
Inference additionally keeps inference-gateway
Inference additionally keeps the Dynamo smoke-test tail

That would remove the remaining accidental drift:

training currently does not exercise robust-controller
inference currently does not exercise gang-scheduling
inference still carries dead deployment-phase plumbing
training has no Kubeflow chainsaw/assert surface today

fix(ci): remove standalone GPU conformance workflow

24a13cd

yuanchen8911 requested a review from a team as a code owner April 14, 2026 22:54

yuanchen8911 added the enhancement New feature or request label Apr 14, 2026

github-actions bot added area/ci size/L labels Apr 14, 2026

lockwobr assigned lockwobr and yuanchen8911 and unassigned lockwobr Apr 14, 2026

yuanchen8911 changed the title ~~fix(ci): remove standalone GPU conformance workflow~~ fix(ci): align GPU CI by removing conformance job Apr 14, 2026

yuanchen8911 requested review from mchmarny and xdu31 April 15, 2026 00:37

yuanchen8911 changed the title ~~fix(ci): align GPU CI by removing conformance job~~ fix(ci): deduplicate conformance coverage in GPU CI Apr 15, 2026

This was referenced Apr 15, 2026

feat(ci): remove standalone conformance workflow #554

Closed

fix(ci): make GPU training and inference symmetric yuanchen8911/aicr#1

Closed

fix(ci): make GPU training and inference symmetric #579

Open

dims approved these changes Apr 15, 2026

View reviewed changes

yuanchen8911 merged commit f5f7387 into NVIDIA:main Apr 15, 2026
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): deduplicate conformance coverage in GPU CI#577

fix(ci): deduplicate conformance coverage in GPU CI#577
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/issue-554-gpu-conformance-dedup

yuanchen8911 commented Apr 14, 2026 •

edited

Loading

Uh oh!

yuanchen8911 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

yuanchen8911 commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 14, 2026 •

edited

Loading