fix(ci): make GPU training and inference symmetric by yuanchen8911 · Pull Request #579 · NVIDIA/aicr

yuanchen8911 · 2026-04-15T00:49:45Z

Summary

Dependency: This PR depends on #577. It is opened against main like #577, so GitHub will show the #577 dedup changes in the diff until #577 merges.

Make the H100 training and inference GPU workflows symmetric, and fix the PR #579 E2E/tooling regression by pinning GoReleaser and aligning the repo toolchain to Go 1.26.2.

Motivation / Context

This is the follow-up to the GPU conformance dedup work in #577.
It removes the remaining drift between training and inference GPU CI so both workflows exercise the same core conformance checks, with only controller/gateway coverage and the inference smoke-test tail differing.

It also fixes the PR #579 E2E failure where tools/setup-tools installed GoReleaser via go install ...@latest, which drifted to v2.15.3 and required Go 1.26.2 while CI was still on Go 1.26.1.

Fixes: N/A
Related: #554, #577

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: CI workflows, chainsaw coverage, and tool/version plumbing

Implementation Notes

Adds recipes/overlays/h100-kind-training-kubeflow.yaml and updates the training workflow to install platform: kubeflow.
Extends training conformance coverage with robust-controller and secure-accelerator-access, and adds Kubeflow chainsaw/assert coverage for the trainer controller, webhook, and TrainJob CRD.
Moves inference to H100 x2, adds gang-scheduling to the kind inference recipes, and removes dead deployment phase plumbing from the inference workflow.
Aligns timeout budgets and path filters so the remaining workflow differences are the deployed platform-specific components and the inference smoke-test tail.
Pins GoReleaser to v2.15.3 in .settings.yaml, threads that exact version through tools/setup-tools and all goreleaser-action call sites, and bumps the repo Go version references to 1.26.2.
Follow-up cleanup is expected separately: harden goreleaser_version inputs so callers cannot drift from .settings.yaml, update stale docs that still mention older GoReleaser versions, remove the orphaned tests/chainsaw/ai-conformance/kind-training/ directory, and consider pinning the macOS brew install goreleaser path in tools/setup-tools.

Testing

unset GITLAB_TOKEN
make qualify

make qualify passed test-coverage, lint, and e2e locally.
make qualify failed only at scan locally because grype dir:. picked up a stale repo-root ./aicr binary built with go1.26.1; the freshly rebuilt dist/aicr_darwin_arm64_v8.0/aicr is go1.26.2.

Risk Assessment

Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout
Low — Isolated change, well-tested, easy to revert

Rollout notes: This changes the inference GPU job to require H100 x2, assumes the corresponding runner class remains available, and pins release/build tooling to GoReleaser v2.15.3. Merge this after #577.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

yuanchen8911 requested review from a team as code owners April 15, 2026 00:49

github-actions bot added area/recipes area/ci area/tests size/XL labels Apr 15, 2026

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 61e2e09 to c78e996 Compare April 15, 2026 01:17

yuanchen8911 mentioned this pull request Apr 15, 2026

fix(ci): pin e2e goreleaser and exclude local build artifacts #580

Merged

25 tasks

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from c78e996 to 4085ee2 Compare April 15, 2026 01:39

github-actions bot added size/L and removed size/XL labels Apr 15, 2026

fix(ci): make GPU training and inference symmetric

81349f8

yuanchen8911 force-pushed the codex/gpu-ci-symmetry branch from 4085ee2 to 81349f8 Compare April 15, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): make GPU training and inference symmetric#579

fix(ci): make GPU training and inference symmetric#579
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-ci-symmetry

yuanchen8911 commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuanchen8911 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanchen8911 commented Apr 15, 2026 •

edited

Loading