[test] fix(trtllm): DYN-2715 on release/1.1.0 with #8324 IRSA cherry-pick#8338
Conversation
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
…nstall TRT-LLM's NVRTC JIT path (FMHA kernel compilation on Blackwell sm_100a) discovers its install location at runtime by shelling out to `pip show tensorrt_llm`. The runtime venv is built with `uv pip install`, which does not place `pip` inside the venv, so the subprocess resolves to the system `/usr/bin/pip` and cannot see uv-managed site-packages. `pip show` then returns "Package(s) not found" and TRT-LLM passes zero `-I` options to NVRTC, failing the FMHA JIT with: NVRTC_ERROR_COMPILATION ... could not open source file "cuda.h" (no directories in search list) The failure only surfaces on Blackwell because sm_90 (Hopper) ships pre-compiled cubins and never invokes NVRTC. Fixes DYN-2715. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
|
👋 Hi nv-yna! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
There was a problem hiding this comment.
🚩 Rust checks job still uses static AWS keys for sccache
In container-validation-dynamo.yml:182-183, the Rust checks job still passes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables to docker run. This is a different context from the Docker build migration (it's runtime sccache inside a test container, not build-time sccache), so IRSA tokens from the runner pod aren't automatically available inside the container. This is likely intentional — migrating this would require mounting the IRSA token file into the container. However, if the org is planning to fully deprecate static AWS keys, this will need a separate migration.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Purpose: verify that the sccache IRSA CI fix unblocks DYN-2715 builds against
release/1.1.0.This branch contains two cherry-picks on top of
release/1.1.0:refactor(ci): switch sccache auth to IRSA web identity #8324 —
refactor(ci): switch sccache auth to IRSA web identity(cherry-picked frommain, commit8428c65f8a). The static AWS access key auth was revoked at roughly2026-04-17T20:54Z(when refactor(ci): switch sccache auth to IRSA web identity #8324 merged tomain). Since then, everyrelease/1.1.0PR that triggers a container/Rust build fails withS3Error { code: "InvalidAccessKeyId" }. This commit switches sccache to IRSA web-identity (STS AssumeRoleWithWebIdentity).DYN-2715 fix (same as fix(trtllm): install pip into runtime venv for NVRTC JIT include discovery #8297): add
pipto theuv pip installincontainer/templates/trtllm_runtime.Dockerfileso TRT-LLM's NVRTC JIT can locate its install viapip show tensorrt_llm(required for FMHA kernel JIT compilation on Blackwell sm_100a).Why bundle them
#8297 reruns failed 3× in a row with identical
InvalidAccessKeyIderrors — reruns won't help. The same symptom hits PRs #8314, #8336, and every other recentrelease/1.1.0container build.release/1.0.2succeeds because it has the IRSA fix;release/1.1.0does not.If this combined PR's CI passes, that confirms the IRSA cherry-pick is the right unblocker for
release/1.1.0. Then the CI fix can be merged torelease/1.1.0on its own, and #8297 becomes mergeable after a rebase.Conflict resolution during cherry-pick of #8324
Two trivial conflicts, both resolved in favor of the
#8324side:.github/actions/docker-remote-build/action.yml— only a comment block ("Pass AWS credentials ..." → "Pass IRSA web identity token ...")container/templates/wheel_builder.Dockerfile— one added line (find /tmp/ffmpeg-${FFMPEG_VERSION} \( -name config.log -o -name config.status \) -delete)Risks
release/1.1.0rather than through this combined PR — bundling an unrelated CI fix into a feature fix is noisy and makes the history harder to read.Test plan
trtllm-runtime / Build multi-arch cuda13.1,trtllm-dev / Build multi-arch cuda13.1, andBuildmust pass (no sccacheInvalidAccessKeyId).lib/bench/src/bin/README.mdis pre-existing onrelease/1.1.0— unrelated; expected to still fail.Related: #8297, #8296, #8324.