Skip to content

ci: cherry-pick #8324 (switch sccache auth to IRSA) to release/1.1.0#8339

Closed
nv-yna wants to merge 1 commit into
ai-dynamo:release/1.1.0from
nv-yna:yna/cp-8324-release-1.1.0
Closed

ci: cherry-pick #8324 (switch sccache auth to IRSA) to release/1.1.0#8339
nv-yna wants to merge 1 commit into
ai-dynamo:release/1.1.0from
nv-yna:yna/cp-8324-release-1.1.0

Conversation

@nv-yna
Copy link
Copy Markdown
Contributor

@nv-yna nv-yna commented Apr 18, 2026

Summary

Cherry-pick of #8324 (refactor(ci): switch sccache auth to IRSA web identity) from main to release/1.1.0. Single-commit, CI-only change — no runtime behavior change.

Why

PR #8324 merged to main at 2026-04-17T20:54Z and, as part of the switch from static AWS keys to IRSA web-identity auth for sccache S3, effectively revoked the old static keys. Since then, every PR targeting release/1.1.0 that triggers a container or Rust build fails with:

sccache: error: Server startup failed: cache storage failed to read:
  PermissionDenied (permanent) at read => S3Error {
    code: "InvalidAccessKeyId",
    message: "The AWS Access Key Id you provided does not exist in our records."
  }

release/1.0.2 already received this cherry-pick and is unblocked. release/1.1.0 is stuck without it.

Evidence that this cherry-pick fixes the issue

PR #8338 (a combined experimental PR with this cherry-pick + the DYN-2715 trtllm runtime fix from #8297) runs green on release/1.1.0 CI:

Only remaining failures on #8338 are a pre-existing broken markdown link (lychee 404 on lib/bench/src/bin/README.md, shared across other release/1.1.0 PRs, not introduced by any of these commits) and the PR-title lint on an earlier rename — neither is related to the sccache fix.

Diff: git diff main...origin/release/1.1.0 -- <affected files> produces only the two trivial comment/whitespace-style conflicts that were resolved in favor of the #8324 side:

  • .github/actions/docker-remote-build/action.yml: comment block describing the auth method (old: "AWS credentials"; new: "IRSA web identity token").
  • container/templates/wheel_builder.Dockerfile: one additional line that deletes config.log / config.status after ffmpeg build (find /tmp/ffmpeg-${FFMPEG_VERSION} \( -name config.log -o -name config.status \) -delete).

Test plan

Links


Open with Devin

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@nv-yna nv-yna requested review from a team as code owners April 18, 2026 02:41
@nv-yna nv-yna requested a review from dillon-cullinan April 18, 2026 02:41
@github-actions github-actions Bot added the ci Issues/PRs that reference CI build/test label Apr 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi nv-yna! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor container actions labels Apr 18, 2026
@nv-yna nv-yna requested a review from sara4dev April 18, 2026 02:41
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Rust-checks step still uses static AWS credentials at runtime

The rust-checks job in container-validation-dynamo.yml:182-183 still passes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables to docker run for sccache inside the running container. This is a runtime (not build-time) usage and is intentionally outside the scope of this PR. However, if the organization fully deprecates these static keys, this step will lose sccache caching silently (the use-sccache.sh script handles this gracefully via its if sccache --start-server guard). Worth tracking as a follow-up migration item.

(Refers to lines 182-183)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +157 to 170
# Pass IRSA web identity token as build secrets for sccache S3 access.
# The runner pod has IRSA which provides AWS_WEB_IDENTITY_TOKEN_FILE and
# AWS_ROLE_ARN. We pass the token file and role ARN to BuildKit so sccache
# can authenticate via STS AssumeRoleWithWebIdentity -- no static keys needed.
SECRET_ARGS=""
if [ "${{ inputs.use_sccache }}" == "true" ] && [ -n "${AWS_ACCESS_KEY_ID:-}" ]; then
SECRET_ARGS+=" --secret id=aws-key-id,env=AWS_ACCESS_KEY_ID"
SECRET_ARGS+=" --secret id=aws-secret-id,env=AWS_SECRET_ACCESS_KEY"
if [ "${{ inputs.use_sccache }}" == "true" ]; then
TOKEN_FILE="${AWS_WEB_IDENTITY_TOKEN_FILE:-}"
if [ -n "$TOKEN_FILE" ] && [ -f "$TOKEN_FILE" ] && [ -n "${AWS_ROLE_ARN:-}" ]; then
SECRET_ARGS+=" --secret id=aws-web-identity-token,src=${TOKEN_FILE}"
SECRET_ARGS+=" --secret id=aws-role-arn,env=AWS_ROLE_ARN"
else
echo "::warning::IRSA web identity token not available; sccache S3 cache will be disabled"
fi
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 IRSA token expiration during long builds

The old approach used static IAM keys which never expire. IRSA web identity tokens have a limited lifetime (typically 1 hour, configurable up to 12 hours). The token content is captured at docker buildx build invocation time via --secret id=aws-web-identity-token,src=${TOKEN_FILE} and remains fixed throughout the build. For very long builds (some framework builds take 60+ minutes per build_timeout_minutes), the token could expire mid-build, causing sccache S3 writes to fail. The use-sccache.sh setup-env logic would have already started the server successfully, so failures would occur on individual cache operations rather than at startup. This is an inherent tradeoff of moving to short-lived credentials, and sccache should degrade gracefully (cache misses, not build failures).

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@nv-yna nv-yna enabled auto-merge (squash) April 20, 2026 17:21
@nv-yna nv-yna closed this Apr 20, 2026
auto-merge was automatically disabled April 20, 2026 17:47

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions ci Issues/PRs that reference CI build/test container external-contribution Pull request is from an external contributor size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants