Skip to content

refactor(ci): switch sccache auth to IRSA web identity#8324

Merged
dmitry-tokarev-nv merged 3 commits into
mainfrom
saperiyasamy/sccache-irsa-auth
Apr 17, 2026
Merged

refactor(ci): switch sccache auth to IRSA web identity#8324
dmitry-tokarev-nv merged 3 commits into
mainfrom
saperiyasamy/sccache-irsa-auth

Conversation

@sara4dev
Copy link
Copy Markdown
Contributor

@sara4dev sara4dev commented Apr 17, 2026

Summary

  • Replace static AWS access key secrets with IRSA web identity token for sccache S3 authentication in BuildKit builds
  • Runner pods already have IRSA configured with sccache-read-write S3 permissions — pass the web identity token as a BuildKit secret file instead of static keys
  • Scrub AWS env vars in use-sccache.sh after sccache server starts (it holds credentials in-process)

Details

sccache's Rust AWS SDK natively supports AssumeRoleWithWebIdentity, which calls the STS HTTPS endpoint directly — accessible from the BuildKit sandbox unlike IMDS/Pod Identity Agent endpoints.

Environment visible to build tools now only contains AWS_ROLE_ARN (a public identifier) and AWS_WEB_IDENTITY_TOKEN_FILE (a file path) — both non-sensitive and scrubbed before ./configure runs.

Test plan

  • Trigger a container build on the aws-ci cluster and verify sccache hits S3
  • Verify config.log in built images contains no credentials

Made with Cursor

Summary by CodeRabbit

  • Chores
    • Migrated build pipeline authentication from static AWS credentials to AWS IAM Roles Anywhere web identity tokens across all container build workflows.
    • Updated container build configuration to use token-based authentication instead of access key credentials.
    • Cleaned up build artifact removal in container build scripts.

Replace static AWS access key secrets with IRSA web identity token
for sccache S3 authentication in BuildKit builds. The runner pods
already have IRSA configured with the necessary S3 permissions, so
we pass the web identity token as a BuildKit secret file instead.

sccache's Rust AWS SDK natively supports AssumeRoleWithWebIdentity,
which calls the STS HTTPS endpoint directly (accessible from the
BuildKit sandbox, unlike IMDS).

Also scrub AWS env vars in use-sccache.sh after the sccache server
starts since it holds credentials in-process.

Made-with: Cursor
The eval'd output from setup-env concatenates fi and unset on one
line, causing a syntax error in /bin/sh. Add trailing semicolon.

Made-with: Cursor
No env var scrubbing needed — IRSA auth uses a mounted token file
and a non-sensitive role ARN, neither of which requires cleanup.

Made-with: Cursor
@sara4dev sara4dev marked this pull request as ready for review April 17, 2026 20:54
@sara4dev sara4dev requested review from a team as code owners April 17, 2026 20:54
@dmitry-tokarev-nv dmitry-tokarev-nv merged commit 8428c65 into main Apr 17, 2026
68 of 70 checks passed
@dmitry-tokarev-nv dmitry-tokarev-nv deleted the saperiyasamy/sccache-irsa-auth branch April 17, 2026 20:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d7e67783-81c3-45df-9a84-51bf71844450

📥 Commits

Reviewing files that changed from the base of the PR and between 32ec044 and 6e141ce.

📒 Files selected for processing (9)
  • .github/actions/build-flavor/action.yml
  • .github/actions/docker-build/action.yml
  • .github/actions/docker-remote-build/action.yml
  • .github/workflows/build-flavor.yml
  • .github/workflows/build-frontend-image.yaml
  • .github/workflows/build-test-distribute-flavor.yml
  • .github/workflows/container-validation-dynamo.yml
  • .github/workflows/shared-build-image.yml
  • container/templates/wheel_builder.Dockerfile

Walkthrough

This PR removes static AWS access key credentials (aws_access_key_id, aws_secret_access_key) from GitHub Actions and workflow configurations, replacing them with AWS IAM Roles Anywhere (IRSA) web identity authentication. The change migrates from explicit credential inputs to deriving secrets from AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN.

Changes

Cohort / File(s) Summary
GitHub Actions — Credential Input Removal
.github/actions/build-flavor/action.yml, .github/actions/docker-build/action.yml, .github/actions/docker-remote-build/action.yml
Removed aws_access_key_id and aws_secret_access_key inputs from composite actions. docker-remote-build now derives BuildKit secrets conditionally from IRSA (AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN) instead of static credentials; emits warning if IRSA values are unavailable.
GitHub Workflows — Credential Input Removal
.github/workflows/build-flavor.yml, .github/workflows/build-frontend-image.yaml, .github/workflows/build-test-distribute-flavor.yml, .github/workflows/container-validation-dynamo.yml, .github/workflows/shared-build-image.yml
Removed passing aws_access_key_id and aws_secret_access_key from workflow steps to build actions; region and account ID inputs unchanged.
Dockerfile — IRSA Credential Mechanism
container/templates/wheel_builder.Dockerfile
Replaced mounted secrets from aws-key-id/aws-secret-id with aws-web-identity-token mount and AWS_ROLE_ARN env var across multiple RUN blocks (FFmpeg, UCX, libfabric, AWS SDK C++, runtime wheel, nixl wheel, kvbm builds). Extended FFmpeg cleanup to remove config.status in addition to config.log.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Rust checks job still uses static AWS keys at runtime

In container-validation-dynamo.yml:182-183, the docker run command for Rust checks still passes -e AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} and -e AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} as runtime environment variables. This is a different use case (runtime sccache inside a running container, not build-time secrets), so it's outside the scope of this PR's build-time IRSA migration. However, if the long-term goal is to fully eliminate static AWS keys, this will also need to be migrated to IRSA or another mechanism.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +258 to +260
RUN --mount=type=secret,id=aws-web-identity-token,target=/run/secrets/aws-token \
--mount=type=secret,id=aws-role-arn,env=AWS_ROLE_ARN \
export AWS_WEB_IDENTITY_TOKEN_FILE=/run/secrets/aws-token && \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 IRSA token TTL vs long Docker builds

IRSA web identity tokens have a finite TTL (typically 1-12 hours depending on configuration). The token file is snapshotted at docker buildx build invocation time via --secret id=aws-web-identity-token,src=${TOKEN_FILE}. For very long builds (which can exceed 60+ minutes per the timeout configs), the token may expire mid-build, causing sccache S3 writes to fail silently. The use-sccache.sh graceful fallback (container/use-sccache.sh:104) handles initial startup failures, but mid-build token expiration would manifest differently — sccache operations would start failing after the token expires. The old static keys didn't have this issue. In practice, this may be acceptable since sccache failures are non-fatal, but it could reduce cache hit rates for long builds.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

nv-yna pushed a commit to nv-yna/dynamo that referenced this pull request Apr 18, 2026
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
nv-yna pushed a commit to nv-yna/dynamo that referenced this pull request Apr 18, 2026
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
pvijayakrish added a commit that referenced this pull request Apr 19, 2026
…pick (#8338)

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Co-authored-by: Saravana Periyasamy <saperiyasamy@nvidia.com>
Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>
Co-authored-by: Pavithra Vijayakrishnan <160681768+pvijayakrish@users.noreply.github.com>
indrajit96 pushed a commit that referenced this pull request Apr 20, 2026
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
nv-anants added a commit that referenced this pull request Apr 23, 2026
The five dynamo-* jobs (build, static-checks, test-parallel, test-sequential,
test-gpu) in pr.yaml, post-merge-ci.yml, and nightly-ci.yml are now one
dynamo-pipeline reusable workflow called once from each entry point. Matches
the vllm-pipeline / sglang-pipeline / trtllm-pipeline pattern and guarantees
the Actions UI collapses all inner jobs under a single "dynamo-runtime"
group.

Addresses PR 8525 review feedback:

- ranrubin: "duplicated flow in the three major workflows ... should be
  moved into a separated flow" — done via dynamo-pipeline.yml.
- ranrubin: "avoid running the deprecated builder instance for testing"
  — CPU pytest jobs now use prod-tester-amd-v1 instead of prod-builder-amd-v1.
- ranrubin: "we moved away from passing the secrets, see PR #8324" — the
  rust-gpu-checks docker run now uses IRSA (AWS_ROLE_ARN +
  AWS_WEB_IDENTITY_TOKEN_FILE bind-mount) instead of AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY. AWS access-key secrets dropped from pipeline.
- ranrubin: "ubuntu-slim cheaper" — dynamo-status-check switched.

Also folds shared-container-static-checks.yml inline into dynamo-pipeline
(it was only used for dynamo). Nightly sets no_cache: true to catch
regressions a cache-warm build would hide; build_timeout_minutes: 90 to
accommodate cold cache.

Signed-off-by: Anant Sharma <anants@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants