Skip to content

fix(ci): make GPU training and inference symmetric#579

Open
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-ci-symmetry
Open

fix(ci): make GPU training and inference symmetric#579
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-ci-symmetry

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 15, 2026

Summary

Dependency: This PR depends on #577. It is opened against main like #577, so GitHub will show the #577 dedup changes in the diff until #577 merges.

Make the H100 training and inference GPU workflows symmetric, and fix the PR #579 E2E/tooling regression by pinning GoReleaser and aligning the repo toolchain to Go 1.26.2.

Motivation / Context

This is the follow-up to the GPU conformance dedup work in #577.
It removes the remaining drift between training and inference GPU CI so both workflows exercise the same core conformance checks, with only controller/gateway coverage and the inference smoke-test tail differing.

It also fixes the PR #579 E2E failure where tools/setup-tools installed GoReleaser via go install ...@latest, which drifted to v2.15.3 and required Go 1.26.2 while CI was still on Go 1.26.1.

Fixes: N/A
Related: #554, #577

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: CI workflows, chainsaw coverage, and tool/version plumbing

Implementation Notes

  • Adds recipes/overlays/h100-kind-training-kubeflow.yaml and updates the training workflow to install platform: kubeflow.
  • Extends training conformance coverage with robust-controller and secure-accelerator-access, and adds Kubeflow chainsaw/assert coverage for the trainer controller, webhook, and TrainJob CRD.
  • Moves inference to H100 x2, adds gang-scheduling to the kind inference recipes, and removes dead deployment phase plumbing from the inference workflow.
  • Aligns timeout budgets and path filters so the remaining workflow differences are the deployed platform-specific components and the inference smoke-test tail.
  • Pins GoReleaser to v2.15.3 in .settings.yaml, threads that exact version through tools/setup-tools and all goreleaser-action call sites, and bumps the repo Go version references to 1.26.2.
  • Follow-up cleanup is expected separately: harden goreleaser_version inputs so callers cannot drift from .settings.yaml, update stale docs that still mention older GoReleaser versions, remove the orphaned tests/chainsaw/ai-conformance/kind-training/ directory, and consider pinning the macOS brew install goreleaser path in tools/setup-tools.

Testing

unset GITLAB_TOKEN
make qualify
  • make qualify passed test-coverage, lint, and e2e locally.
  • make qualify failed only at scan locally because grype dir:. picked up a stale repo-root ./aicr binary built with go1.26.1; the freshly rebuilt dist/aicr_darwin_arm64_v8.0/aicr is go1.26.2.

Risk Assessment

  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout
  • Low — Isolated change, well-tested, easy to revert

Rollout notes: This changes the inference GPU job to require H100 x2, assumes the corresponding runner class remains available, and pins release/build tooling to GoReleaser v2.15.3. Merge this after #577.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 4085ee2 to 81349f8 Compare April 15, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant