Skip to content

[test] fix(trtllm): DYN-2715 on release/1.1.0 with #8324 IRSA cherry-pick#8338

Merged
pvijayakrish merged 3 commits into
ai-dynamo:release/1.1.0from
nv-yna:yna/dyn-2715-release-1.1.0-with-ci-fix
Apr 19, 2026
Merged

[test] fix(trtllm): DYN-2715 on release/1.1.0 with #8324 IRSA cherry-pick#8338
pvijayakrish merged 3 commits into
ai-dynamo:release/1.1.0from
nv-yna:yna/dyn-2715-release-1.1.0-with-ci-fix

Conversation

@nv-yna
Copy link
Copy Markdown
Contributor

@nv-yna nv-yna commented Apr 18, 2026

Summary

Purpose: verify that the sccache IRSA CI fix unblocks DYN-2715 builds against release/1.1.0.

This branch contains two cherry-picks on top of release/1.1.0:

  1. refactor(ci): switch sccache auth to IRSA web identity #8324refactor(ci): switch sccache auth to IRSA web identity (cherry-picked from main, commit 8428c65f8a). The static AWS access key auth was revoked at roughly 2026-04-17T20:54Z (when refactor(ci): switch sccache auth to IRSA web identity #8324 merged to main). Since then, every release/1.1.0 PR that triggers a container/Rust build fails with S3Error { code: "InvalidAccessKeyId" }. This commit switches sccache to IRSA web-identity (STS AssumeRoleWithWebIdentity).

  2. DYN-2715 fix (same as fix(trtllm): install pip into runtime venv for NVRTC JIT include discovery #8297): add pip to the uv pip install in container/templates/trtllm_runtime.Dockerfile so TRT-LLM's NVRTC JIT can locate its install via pip show tensorrt_llm (required for FMHA kernel JIT compilation on Blackwell sm_100a).

Why bundle them

#8297 reruns failed 3× in a row with identical InvalidAccessKeyId errors — reruns won't help. The same symptom hits PRs #8314, #8336, and every other recent release/1.1.0 container build. release/1.0.2 succeeds because it has the IRSA fix; release/1.1.0 does not.

If this combined PR's CI passes, that confirms the IRSA cherry-pick is the right unblocker for release/1.1.0. Then the CI fix can be merged to release/1.1.0 on its own, and #8297 becomes mergeable after a rebase.

Conflict resolution during cherry-pick of #8324

Two trivial conflicts, both resolved in favor of the #8324 side:

  • .github/actions/docker-remote-build/action.yml — only a comment block ("Pass AWS credentials ..." → "Pass IRSA web identity token ...")
  • container/templates/wheel_builder.Dockerfile — one added line (find /tmp/ffmpeg-${FFMPEG_VERSION} \( -name config.log -o -name config.status \) -delete)

Risks

  • If this experiment works, we should still merge the CI fix as its own PR to release/1.1.0 rather than through this combined PR — bundling an unrelated CI fix into a feature fix is noisy and makes the history harder to read.
  • If it doesn't work, the diagnosis needs revisiting.

Test plan

  • Cherry-picks apply cleanly (two comment-style conflicts, resolved).
  • DCO sign-off on both commits.
  • CI runs to completion — specifically trtllm-runtime / Build multi-arch cuda13.1, trtllm-dev / Build multi-arch cuda13.1, and Build must pass (no sccache InvalidAccessKeyId).
  • Lychee 404 on lib/bench/src/bin/README.md is pre-existing on release/1.1.0 — unrelated; expected to still fail.

Related: #8297, #8296, #8324.


Open with Devin

sara4dev and others added 2 commits April 17, 2026 18:50
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…nstall

TRT-LLM's NVRTC JIT path (FMHA kernel compilation on Blackwell sm_100a)
discovers its install location at runtime by shelling out to
`pip show tensorrt_llm`. The runtime venv is built with `uv pip install`,
which does not place `pip` inside the venv, so the subprocess resolves
to the system `/usr/bin/pip` and cannot see uv-managed site-packages.
`pip show` then returns "Package(s) not found" and TRT-LLM passes zero
`-I` options to NVRTC, failing the FMHA JIT with:

  NVRTC_ERROR_COMPILATION ... could not open source file "cuda.h"
  (no directories in search list)

The failure only surfaces on Blackwell because sm_90 (Hopper) ships
pre-compiled cubins and never invokes NVRTC.

Fixes DYN-2715.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi nv-yna! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor container actions labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Rust checks job still uses static AWS keys for sccache

In container-validation-dynamo.yml:182-183, the Rust checks job still passes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables to docker run. This is a different context from the Docker build migration (it's runtime sccache inside a test container, not build-time sccache), so IRSA tokens from the runner pod aren't automatically available inside the container. This is likely intentional — migrating this would require mounting the IRSA token file into the container. However, if the org is planning to fully deprecate static AWS keys, this will need a separate migration.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@pvijayakrish pvijayakrish merged commit f52010e into ai-dynamo:release/1.1.0 Apr 19, 2026
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions container external-contribution Pull request is from an external contributor size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants