ci: cherry-pick #8324 (switch sccache auth to IRSA) to release/1.1.0#8339
ci: cherry-pick #8324 (switch sccache auth to IRSA) to release/1.1.0#8339nv-yna wants to merge 1 commit into
Conversation
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
|
👋 Hi nv-yna! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
There was a problem hiding this comment.
🚩 Rust-checks step still uses static AWS credentials at runtime
The rust-checks job in container-validation-dynamo.yml:182-183 still passes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables to docker run for sccache inside the running container. This is a runtime (not build-time) usage and is intentionally outside the scope of this PR. However, if the organization fully deprecates these static keys, this step will lose sccache caching silently (the use-sccache.sh script handles this gracefully via its if sccache --start-server guard). Worth tracking as a follow-up migration item.
(Refers to lines 182-183)
Was this helpful? React with 👍 or 👎 to provide feedback.
| # Pass IRSA web identity token as build secrets for sccache S3 access. | ||
| # The runner pod has IRSA which provides AWS_WEB_IDENTITY_TOKEN_FILE and | ||
| # AWS_ROLE_ARN. We pass the token file and role ARN to BuildKit so sccache | ||
| # can authenticate via STS AssumeRoleWithWebIdentity -- no static keys needed. | ||
| SECRET_ARGS="" | ||
| if [ "${{ inputs.use_sccache }}" == "true" ] && [ -n "${AWS_ACCESS_KEY_ID:-}" ]; then | ||
| SECRET_ARGS+=" --secret id=aws-key-id,env=AWS_ACCESS_KEY_ID" | ||
| SECRET_ARGS+=" --secret id=aws-secret-id,env=AWS_SECRET_ACCESS_KEY" | ||
| if [ "${{ inputs.use_sccache }}" == "true" ]; then | ||
| TOKEN_FILE="${AWS_WEB_IDENTITY_TOKEN_FILE:-}" | ||
| if [ -n "$TOKEN_FILE" ] && [ -f "$TOKEN_FILE" ] && [ -n "${AWS_ROLE_ARN:-}" ]; then | ||
| SECRET_ARGS+=" --secret id=aws-web-identity-token,src=${TOKEN_FILE}" | ||
| SECRET_ARGS+=" --secret id=aws-role-arn,env=AWS_ROLE_ARN" | ||
| else | ||
| echo "::warning::IRSA web identity token not available; sccache S3 cache will be disabled" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
🚩 IRSA token expiration during long builds
The old approach used static IAM keys which never expire. IRSA web identity tokens have a limited lifetime (typically 1 hour, configurable up to 12 hours). The token content is captured at docker buildx build invocation time via --secret id=aws-web-identity-token,src=${TOKEN_FILE} and remains fixed throughout the build. For very long builds (some framework builds take 60+ minutes per build_timeout_minutes), the token could expire mid-build, causing sccache S3 writes to fail. The use-sccache.sh setup-env logic would have already started the server successfully, so failures would occur on individual cache operations rather than at startup. This is an inherent tradeoff of moving to short-lived credentials, and sccache should degrade gracefully (cache misses, not build failures).
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Cherry-pick of #8324 (
refactor(ci): switch sccache auth to IRSA web identity) frommaintorelease/1.1.0. Single-commit, CI-only change — no runtime behavior change.Why
PR #8324 merged to
mainat2026-04-17T20:54Zand, as part of the switch from static AWS keys to IRSA web-identity auth for sccache S3, effectively revoked the old static keys. Since then, every PR targetingrelease/1.1.0that triggers a container or Rust build fails with:release/1.0.2already received this cherry-pick and is unblocked.release/1.1.0is stuck without it.Evidence that this cherry-pick fixes the issue
PR #8338 (a combined experimental PR with this cherry-pick + the DYN-2715 trtllm runtime fix from #8297) runs green on
release/1.1.0CI:Build,Rust Checks,Pre Merge— previously failed on fix(trtllm): install pip into runtime venv for NVRTC JIT include discovery #8297 withInvalidAccessKeyIdtrtllm-runtime / Build multi-arch cuda13.1,trtllm-dev / Build multi-arch cuda13.1vllm-runtime,vllm-dev,sglang-runtime,sglang-dev,plannermulti-arch builds — all greenOnly remaining failures on #8338 are a pre-existing broken markdown link (lychee 404 on
lib/bench/src/bin/README.md, shared across other release/1.1.0 PRs, not introduced by any of these commits) and the PR-title lint on an earlier rename — neither is related to the sccache fix.Diff:
git diff main...origin/release/1.1.0 -- <affected files>produces only the two trivial comment/whitespace-style conflicts that were resolved in favor of the#8324side:.github/actions/docker-remote-build/action.yml: comment block describing the auth method (old: "AWS credentials"; new: "IRSA web identity token").container/templates/wheel_builder.Dockerfile: one additional line that deletesconfig.log/config.statusafter ffmpeg build (find /tmp/ffmpeg-${FFMPEG_VERSION} \( -name config.log -o -name config.status \) -delete).Test plan
Links
main: refactor(ci): switch sccache auth to IRSA web identity #8324release/1.1.0PRs that touch container code.