refactor(ci): switch sccache auth to IRSA web identity#8324
Conversation
Replace static AWS access key secrets with IRSA web identity token for sccache S3 authentication in BuildKit builds. The runner pods already have IRSA configured with the necessary S3 permissions, so we pass the web identity token as a BuildKit secret file instead. sccache's Rust AWS SDK natively supports AssumeRoleWithWebIdentity, which calls the STS HTTPS endpoint directly (accessible from the BuildKit sandbox, unlike IMDS). Also scrub AWS env vars in use-sccache.sh after the sccache server starts since it holds credentials in-process. Made-with: Cursor
The eval'd output from setup-env concatenates fi and unset on one line, causing a syntax error in /bin/sh. Add trailing semicolon. Made-with: Cursor
No env var scrubbing needed — IRSA auth uses a mounted token file and a non-sensitive role ARN, neither of which requires cleanup. Made-with: Cursor
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
WalkthroughThis PR removes static AWS access key credentials ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🚩 Rust checks job still uses static AWS keys at runtime
In container-validation-dynamo.yml:182-183, the docker run command for Rust checks still passes -e AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} and -e AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} as runtime environment variables. This is a different use case (runtime sccache inside a running container, not build-time secrets), so it's outside the scope of this PR's build-time IRSA migration. However, if the long-term goal is to fully eliminate static AWS keys, this will also need to be migrated to IRSA or another mechanism.
Was this helpful? React with 👍 or 👎 to provide feedback.
| RUN --mount=type=secret,id=aws-web-identity-token,target=/run/secrets/aws-token \ | ||
| --mount=type=secret,id=aws-role-arn,env=AWS_ROLE_ARN \ | ||
| export AWS_WEB_IDENTITY_TOKEN_FILE=/run/secrets/aws-token && \ |
There was a problem hiding this comment.
🚩 IRSA token TTL vs long Docker builds
IRSA web identity tokens have a finite TTL (typically 1-12 hours depending on configuration). The token file is snapshotted at docker buildx build invocation time via --secret id=aws-web-identity-token,src=${TOKEN_FILE}. For very long builds (which can exceed 60+ minutes per the timeout configs), the token may expire mid-build, causing sccache S3 writes to fail silently. The use-sccache.sh graceful fallback (container/use-sccache.sh:104) handles initial startup failures, but mid-build token expiration would manifest differently — sccache operations would start failing after the token expires. The old static keys didn't have this issue. In practice, this may be acceptable since sccache failures are non-fatal, but it could reduce cache hit rates for long builds.
Was this helpful? React with 👍 or 👎 to provide feedback.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…pick (#8338) Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com> Co-authored-by: Saravana Periyasamy <saperiyasamy@nvidia.com> Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com> Co-authored-by: Pavithra Vijayakrishnan <160681768+pvijayakrish@users.noreply.github.com>
Signed-off-by: Indrajit Bhosale <iamindrajitb@gmail.com>
The five dynamo-* jobs (build, static-checks, test-parallel, test-sequential, test-gpu) in pr.yaml, post-merge-ci.yml, and nightly-ci.yml are now one dynamo-pipeline reusable workflow called once from each entry point. Matches the vllm-pipeline / sglang-pipeline / trtllm-pipeline pattern and guarantees the Actions UI collapses all inner jobs under a single "dynamo-runtime" group. Addresses PR 8525 review feedback: - ranrubin: "duplicated flow in the three major workflows ... should be moved into a separated flow" — done via dynamo-pipeline.yml. - ranrubin: "avoid running the deprecated builder instance for testing" — CPU pytest jobs now use prod-tester-amd-v1 instead of prod-builder-amd-v1. - ranrubin: "we moved away from passing the secrets, see PR #8324" — the rust-gpu-checks docker run now uses IRSA (AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE bind-mount) instead of AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY. AWS access-key secrets dropped from pipeline. - ranrubin: "ubuntu-slim cheaper" — dynamo-status-check switched. Also folds shared-container-static-checks.yml inline into dynamo-pipeline (it was only used for dynamo). Nightly sets no_cache: true to catch regressions a cache-warm build would hide; build_timeout_minutes: 90 to accommodate cold cache. Signed-off-by: Anant Sharma <anants@nvidia.com>
Summary
sccache-read-writeS3 permissions — pass the web identity token as a BuildKit secret file instead of static keysuse-sccache.shafter sccache server starts (it holds credentials in-process)Details
sccache's Rust AWS SDK natively supports
AssumeRoleWithWebIdentity, which calls the STS HTTPS endpoint directly — accessible from the BuildKit sandbox unlike IMDS/Pod Identity Agent endpoints.Environment visible to build tools now only contains
AWS_ROLE_ARN(a public identifier) andAWS_WEB_IDENTITY_TOKEN_FILE(a file path) — both non-sensitive and scrubbed before./configureruns.Test plan
aws-cicluster and verify sccache hits S3config.login built images contains no credentialsMade with Cursor
Summary by CodeRabbit