Skip to content

fix(ci): move GPU concurrency to test jobs#581

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-job-concurrency
Apr 15, 2026
Merged

fix(ci): move GPU concurrency to test jobs#581
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/gpu-job-concurrency

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Move GPU workflow concurrency from the workflow level down to the heavyweight GPU job so the lightweight check-paths gate can always start and surface a visible check on PR updates.

Motivation / Context

GPU workflows on pull-request/* branches currently apply concurrency at the workflow level. When an older GPU run is still in progress, a newer workflow run can remain pending with jobs: [], which means the PR checks UI may not show the new training or inference run at all until the older run clears.

Moving the same concurrency group to the actual GPU job preserves the “latest run wins” behavior for expensive runner usage while still allowing check-paths to execute immediately and materialize a visible check.

Fixes: N/A
Related: #558

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: GPU CI workflows

Implementation Notes

  • Removes workflow-level concurrency from gpu-smoke-test.yaml, gpu-h100-inference-test.yaml, and gpu-h100-training-test.yaml.
  • Reattaches the same concurrency.group and cancel-in-progress: true settings to the heavyweight GPU test job in each workflow.
  • Keeps the existing event-aware concurrency key (${{ github.workflow }}-${{ github.event_name }}-${{ github.ref }}), so manual workflow_dispatch runs remain isolated from push-driven runs.
  • Leaves the check-paths gate unconstrained so it can start immediately, compute the PR diff, and surface a visible status even while an older GPU run is still tearing down.

Testing

git diff --check
yamllint .github/workflows/gpu-h100-training-test.yaml \
  .github/workflows/gpu-h100-inference-test.yaml \
  .github/workflows/gpu-smoke-test.yaml
  • git diff --check passed.
  • yamllint passed on the three changed workflow files.
  • make lint could not complete cleanly in this sandbox because a prior local golangci-lint process left the global parallel lock in place (parallel golangci-lint is running).

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: This only changes when concurrency is applied, not the concurrency key itself. The expensive GPU jobs still cancel older same-branch runs; the check-paths gate is what becomes immediately runnable and visible.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 15, 2026 02:08
@yuanchen8911 yuanchen8911 force-pushed the codex/gpu-job-concurrency branch from a80c90f to 788d567 Compare April 15, 2026 02:24
@yuanchen8911 yuanchen8911 merged commit 1fb1695 into NVIDIA:main Apr 15, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants