Skip to content

fix(ci): deduplicate conformance coverage in GPU CI#577

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/issue-554-gpu-conformance-dedup
Apr 15, 2026
Merged

fix(ci): deduplicate conformance coverage in GPU CI#577
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/issue-554-gpu-conformance-dedup

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 14, 2026

Summary

Deduplicate conformance coverage in GPU CI by removing the standalone H100 conformance workflow and preserving the unique trigger coverage it carried for the remaining GPU training and inference workflows.

Motivation / Context

The standalone H100x2 conformance workflow duplicates the conformance coverage already exercised by the training and inference workflows. Deleting it reduces redundant GPU CI usage, while consolidating the trigger coverage that would otherwise be lost into the two surviving workflows.

This PR is intentionally narrow in runtime behavior. It removes duplicate conformance execution without changing the training or inference workflow steps, and it leaves broader GPU CI symmetry work to a follow-up.

Fixes: #554
Issue: #554
Related: #541

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: GitHub Actions workflows

Implementation Notes

  • Deletes .github/workflows/gpu-h100-conformance-test.yaml.
  • Adds .github/actions/setup-build-tools/** to both remaining GPU workflow path filters because the deleted workflow was the only one carrying that trigger coverage.
  • Adds the shared tests/chainsaw/ai-conformance helper and imported assert files that the training workflow executes indirectly via kind-training/chainsaw-test.yaml, so deleting the standalone workflow does not create a trigger gap for training.
  • Leaves the training and inference workflow runtime behavior unchanged in this PR.
  • Defers broader GPU CI symmetry and path-filter cleanup to follow-up work.

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify
  • make test
  • make lint
  • make qualify ❌ at make scan
  • make scan currently reports dependency vulnerabilities in the local environment:
    • Go stdlib go1.26.1 CVEs fixed in 1.26.2 / 1.25.9
    • pygments 2.19.2 fixed in 2.20.0

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 14, 2026 22:54
@yuanchen8911 yuanchen8911 added the enhancement New feature or request label Apr 14, 2026
@lockwobr lockwobr assigned lockwobr and yuanchen8911 and unassigned lockwobr Apr 14, 2026
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Follow-up proposal after this narrow #554 dedup PR:

The remaining GPU CI shape should be made symmetric so that training and inference both run the same core conformance coverage, with only platform-specific controller/gateway coverage and the inference smoke-test tail differing.

Proposed follow-up changes:

  1. Training
  • Add recipes/overlays/h100-kind-training-kubeflow.yaml
  • Base it on h100-kind-training
  • Set platform: kubeflow
  • Add the platform-kubeflow mixin so the recipe includes kubeflow-trainer
  • Update the training workflow to pass platform: kubeflow
  • Expand kind training conformance checks to include:
    • robust-controller
    • secure-accelerator-access
  1. Inference
  • Move the inference GPU workflow from H100 x1 to H100 x2
  • Raise min_gpu_count to 2
  • Raise timeout-minutes from 90 to 120
  • Add gang-scheduling to the kind inference recipes
  • Remove dead deployment-phase plumbing:
    • validator_phases: 'deployment,conformance' -> validator_phases: 'conformance'
    • --phase deployment --phase conformance -> --phase conformance
  • Keep conformance before the Dynamo smoke test
  1. Chainsaw / assert coverage
  • Add Kubeflow Trainer chainsaw/assert coverage for the training workflow, since robust-controller alone should not be the only Kubeflow surface:
    • trainer controller deployment
    • validating webhook
    • TrainJob CRD

Target steady state:

  • Shared conformance checks in both training and inference:
    • platform-health
    • gpu-operator-health
    • dra-support
    • accelerator-metrics
    • ai-service-metrics
    • gang-scheduling
    • secure-accelerator-access
    • pod-autoscaling
    • cluster-autoscaling
    • robust-controller
  • Inference additionally keeps inference-gateway
  • Inference additionally keeps the Dynamo smoke-test tail

That would remove the remaining accidental drift:

  • training currently does not exercise robust-controller
  • inference currently does not exercise gang-scheduling
  • inference still carries dead deployment-phase plumbing
  • training has no Kubeflow chainsaw/assert surface today

@yuanchen8911 yuanchen8911 changed the title fix(ci): remove standalone GPU conformance workflow fix(ci): align GPU CI by removing conformance job Apr 14, 2026
@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 April 15, 2026 00:37
@yuanchen8911 yuanchen8911 changed the title fix(ci): align GPU CI by removing conformance job fix(ci): deduplicate conformance coverage in GPU CI Apr 15, 2026
@yuanchen8911 yuanchen8911 merged commit f5f7387 into NVIDIA:main Apr 15, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci enhancement New feature or request size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(ci): remove standalone conformance workflow

3 participants