fix: add trtllm build dependency for fault tolerance tests#6140
fix: add trtllm build dependency for fault tolerance tests#6140nv-tusharma wants to merge 4 commits into
Conversation
- Add build-cuda13-amd64 to fault-tolerance-tests needs array - Update BUILD_JOB_PATTERN to match CUDA13 builds for trtllm - Aligns with pattern used by other test jobs in workflow Relates to: OPS-3217 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
WalkthroughThe CI test suite workflow configuration is updated to add CUDA 13 build dependency support. The fault-tolerance-tests job now depends on the CUDA 13 build job, and the build job pattern matching logic is adjusted to handle the CUDA 13 suffix when identifying corresponding build jobs. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The fault-tolerance tests were failing because they tried to use a commit-SHA-based operator image that doesn't exist (nightly CI doesn't build the operator). Changes: - Add 'Determine operator image tag' step to check for nightly operator image - Falls back to 'main-operator' (stable image) if nightly image not found - Update helm install to use the determined tag instead of hardcoded SHA tag - Remove redundant docker pull attempt This mirrors the pattern used in pr.yaml for operator image selection. Relates to: OPS-3217 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
Update: Additional Fix for Operator ImageAfter the first run, I discovered a second issue - the fault-tolerance tests were trying to use an operator image that doesn't exist. Root CauseThe nightly CI doesn't build the operator image, but the fault-tolerance tests were trying to use Solution (commit 5cf0867)Added operator image fallback logic (similar to pr.yaml):
TestingTriggered new nightly CI run to validate both fixes:
Run: https://github.com/ai-dynamo/dynamo/actions/runs/21962861746 |
The operator image check was querying ECR but helm pulls from ACR, causing a registry mismatch and deployment failures. Changes: - Update 'Determine operator image tag' to check Azure ACR - This matches where helm actually pulls the operator image from Follow-up: Extract operator deployment into reusable composite action (see GitHub Actions architect recommendations) Relates to: OPS-3217 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
Update: Fixed Registry MismatchIssue #3: Registry MismatchThe operator image check was querying ECR (AWS) but helm was pulling from ACR (Azure), causing deployment failures. Fix (commit 547ebbb)
TestingTriggered new nightly CI run: https://github.com/ai-dynamo/dynamo/actions/runs/$(gh run list --workflow=nightly-ci.yml --branch tusharma/fix-nightly-ci-fault-tolerance-tests --limit 1 --json databaseId -q '.[0].databaseId') Follow-upCreated task #6 to refactor operator deployment into reusable composite action (GitHub Actions best practice) after tests are working. |
Tests were timing out at 60m. Increasing to 120m to allow tests to complete successfully.
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information. |
Summary
This PR fixes the fault tolerance tests in the nightly CI pipeline by adding the missing build dependency for TRT-LLM.
Changes
build-cuda13-amd64tofault-tolerance-testsneeds arrayBUILD_JOB_PATTERNto include CUDA13 conditional for trtllmProblem
The fault-tolerance tests were failing because:
build-cuda13-amd64, notbuild-amd64Testing
This can be tested by triggering the nightly CI workflow on this branch:
Closes OPS-3217
Summary by CodeRabbit