Skip to content

fix(ci): add -mod=vendor to snapshot agent build#534

Merged
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/ci-vendor-flag
Apr 10, 2026
Merged

fix(ci): add -mod=vendor to snapshot agent build#534
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/ci-vendor-flag

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 10, 2026

Summary

Add missing GOFLAGS: -mod=vendor to the "Build snapshot agent" step in aicr-build action, matching the other two build steps.

Motivation / Context

GPU CI tests (training, conformance) fail consistently on cold runners because go build in the snapshot agent step downloads all dependencies from the network instead of using vendored sources. This causes the build to hang silently until the 45-minute job timeout.

Fixes: N/A
Related: #290 (introduced the bug), #505 (similar CI build fix)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Other: .github/actions/aicr-build/action.yml

Implementation Notes

The "Build snapshot agent" step was the only build step missing GOFLAGS: -mod=vendor. The "Build validator images" and "Build aicr binary" steps already had it. On warm runners (cached Go modules) the missing flag was harmless; on cold runners it caused go build to silently fetch all deps, exhausting the job timeout.

Note: The go run calls in the GPU workflow files (gpu-h100-conformance-test.yaml#L154, gpu-h100-training-test.yaml#L153, gpu-h100-inference-test.yaml#L270) also lack an explicit -mod=vendor, but these are not the same bug. With Go 1.26 and a vendor/ directory present, go run defaults to vendor mode automatically. Adding explicit flags there would be a consistency hardening follow-up, not a bug fix.

Testing

One-line CI-only change. Verified by inspecting the action file — the fix makes the snapshot agent step consistent with the other two build steps. Cherry-picked onto PR #523 to validate on GPU runners.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

The 'Build snapshot agent' step in aicr-build was missing GOFLAGS=-mod=vendor,
unlike the other two build steps. On cold CI runners without a Go module cache,
this caused 'go build' to silently download all dependencies from the network,
hanging until the job timeout.

Bug introduced in NVIDIA#290 (container-per-validator engine), latent until runners
lost their warm caches.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Verified that the "Build snapshot agent" step was the only one of three go build steps missing GOFLAGS: -mod=vendor. The fix makes it consistent with "Build validator images" (line 53) and "Build aicr binary" (line 80). Confirmed via git blame that the omission was introduced in #290.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@mchmarny mchmarny merged commit 57e6b8d into NVIDIA:main Apr 10, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants