Skip to content

fix: add trtllm build dependency for fault tolerance tests#6140

Closed
nv-tusharma wants to merge 4 commits into
mainfrom
tusharma/fix-nightly-ci-fault-tolerance-tests
Closed

fix: add trtllm build dependency for fault tolerance tests#6140
nv-tusharma wants to merge 4 commits into
mainfrom
tusharma/fix-nightly-ci-fault-tolerance-tests

Conversation

@nv-tusharma
Copy link
Copy Markdown
Contributor

@nv-tusharma nv-tusharma commented Feb 10, 2026

Summary

This PR fixes the fault tolerance tests in the nightly CI pipeline by adding the missing build dependency for TRT-LLM.

Changes

  • Added build-cuda13-amd64 to fault-tolerance-tests needs array
  • Updated BUILD_JOB_PATTERN to include CUDA13 conditional for trtllm
  • Aligns with the pattern used by other test jobs (unit-tests, integration-tests, etc.)

Problem

The fault-tolerance tests were failing because:

  1. TRT-LLM is only built in build-cuda13-amd64, not build-amd64
  2. The build check pattern didn't account for the "CUDA13" suffix in TRT-LLM build job names

Testing

This can be tested by triggering the nightly CI workflow on this branch:

gh workflow run nightly-ci.yml --ref tusharma/fix-nightly-ci-fault-tolerance-tests

Closes OPS-3217

Summary by CodeRabbit

  • Chores
    • Expanded continuous integration pipeline to support CUDA 13 testing, enabling validation across additional GPU configurations for enhanced platform compatibility.
    • Improved test workflow dependencies and execution sequencing to ensure test suites run with correct build artifacts in proper order.

- Add build-cuda13-amd64 to fault-tolerance-tests needs array
- Update BUILD_JOB_PATTERN to match CUDA13 builds for trtllm
- Aligns with pattern used by other test jobs in workflow

Relates to: OPS-3217

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 10, 2026

Walkthrough

The CI test suite workflow configuration is updated to add CUDA 13 build dependency support. The fault-tolerance-tests job now depends on the CUDA 13 build job, and the build job pattern matching logic is adjusted to handle the CUDA 13 suffix when identifying corresponding build jobs.

Changes

Cohort / File(s) Summary
CI Workflow Configuration
.github/workflows/ci-test-suite.yml
Added build-cuda13-amd64 to fault-tolerance-tests job dependencies and updated BUILD_JOB_PATTERN to include CUDA 13 suffix matching for build job identification.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

A rabbit hops through workflows neat,
With CUDA13 now complete! 🐰✨
Build jobs linked in perfect dance,
Tests flow forth with steady glance,
CI pipelines skip, hop, and prance!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a build dependency for TRT-LLM's CUDA13 build to the fault tolerance tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description follows the required template structure with all key sections present: Overview (Summary), Details (Changes and Problem), and Related Issues (Closes OPS-3217). Content is clear and complete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

The fault-tolerance tests were failing because they tried to use a
commit-SHA-based operator image that doesn't exist (nightly CI doesn't
build the operator).

Changes:
- Add 'Determine operator image tag' step to check for nightly operator image
- Falls back to 'main-operator' (stable image) if nightly image not found
- Update helm install to use the determined tag instead of hardcoded SHA tag
- Remove redundant docker pull attempt

This mirrors the pattern used in pr.yaml for operator image selection.

Relates to: OPS-3217

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
@nv-tusharma
Copy link
Copy Markdown
Contributor Author

Update: Additional Fix for Operator Image

After the first run, I discovered a second issue - the fault-tolerance tests were trying to use an operator image that doesn't exist.

Root Cause

The nightly CI doesn't build the operator image, but the fault-tolerance tests were trying to use ${github.sha}-operator-amd64 which doesn't exist.

Solution (commit 5cf0867)

Added operator image fallback logic (similar to pr.yaml):

  • Try to use nightly-operator-amd64 if available
  • Fall back to main-operator (stable image) if not

Testing

Triggered new nightly CI run to validate both fixes:

  • Build dependency fix (first commit)
  • Operator image fallback (this commit)

Run: https://github.com/ai-dynamo/dynamo/actions/runs/21962861746

The operator image check was querying ECR but helm pulls from ACR,
causing a registry mismatch and deployment failures.

Changes:
- Update 'Determine operator image tag' to check Azure ACR
- This matches where helm actually pulls the operator image from

Follow-up: Extract operator deployment into reusable composite action
(see GitHub Actions architect recommendations)

Relates to: OPS-3217

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
@nv-tusharma
Copy link
Copy Markdown
Contributor Author

Update: Fixed Registry Mismatch

Issue #3: Registry Mismatch

The operator image check was querying ECR (AWS) but helm was pulling from ACR (Azure), causing deployment failures.

Fix (commit 547ebbb)

  • Updated operator image check to query Azure ACR instead of ECR
  • Now checking the same registry where helm pulls images from

Testing

Triggered new nightly CI run: https://github.com/ai-dynamo/dynamo/actions/runs/$(gh run list --workflow=nightly-ci.yml --branch tusharma/fix-nightly-ci-fault-tolerance-tests --limit 1 --json databaseId -q '.[0].databaseId')

Follow-up

Created task #6 to refactor operator deployment into reusable composite action (GitHub Actions best practice) after tests are working.

Tests were timing out at 60m. Increasing to 120m to allow
tests to complete successfully.
@github-actions
Copy link
Copy Markdown
Contributor

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the Stale label Mar 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information.

@github-actions github-actions Bot closed this Mar 23, 2026
@github-actions github-actions Bot deleted the tusharma/fix-nightly-ci-fault-tolerance-tests branch March 23, 2026 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant