ci: Add nightly and release tests for gb200 by chtruong814 · Pull Request #1788 · NVIDIA-NeMo/RL

chtruong814 · 2026-01-16T03:47:29Z

What does this PR do ?

Add nightly and release tests for gb200

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added GB200 system support with optimized 4-GPU per-node configurations for training workflows.
- Enhanced configuration tool with automatic base config detection from inheritance chain.
Documentation
- Added GB200-specific GPU resource guidance across training documentation.
Tests
- Expanded test coverage for multiple LLM and VLM training configurations.
- Added new test suites for GB200 variants and release testing.
Refactor
- Streamlined configuration management by enabling base config auto-inference.
- Consolidated pre-commit hooks for improved maintainability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Terry Kong <terryk@nvidia.com> run config cli test first b/c other run_first tests depend on its correctness Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

This reverts commit 8ca7edc. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

This reverts commit 74d69b9. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

This reverts commit d7648ff. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

coderabbitai · 2026-01-16T04:33:48Z

📝 Walkthrough

Walkthrough

This PR introduces GB200 support by adding 4-GPU-per-node configurations. It includes documentation guidance for GB200 systems, creates 40+ new YAML recipe configs for LLM/VLM models, adds corresponding test shell scripts, establishes test suite manifests, and refactors config minimization tooling to automatically infer base configs from defaults chains.

Changes

Cohort / File(s)	Summary
Pre-commit hook refactoring `.pre-commit-config.yaml`	Renamed hook from `configs-minimize-check-llm` to `configs-minimize-check`, consolidated dual LLM/VLM hooks into single loop over all recipe YAMLs, removed separate `configs-minimize-check-vlm` block
Documentation: GB200 GPU guidance `README.md`, `docs/about/algorithms/grpo.md`, `docs/about/algorithms/dpo.md`, `docs/about/algorithms/dapo.md`, `docs/about/algorithms/sft.md`, `docs/about/algorithms/rm.md`, `docs/about/algorithms/on-policy-distillation.md`, `docs/cluster.md`	Added `[NOTE]` admonitions advising `--gres=gpu:4` per node for GB200 systems across multiple algorithm and cluster documentation sections
GRPO model recipes (4 GPUs) examples/configs/recipes/llm/grpo-{deepscaler-1.5b-1n4g-8K,gemma3-1b-it-1n4g-fsdp2tp1,gemma3-27b-it-8n4g-fsdp2tp4-actckpt-long,gptoss-20b-8n4g-megatron,llama3.1-8b-instruct-2n4g-fsdp2tp1-noncolocated,llama3.1-8b-instruct-4n4g-fsdp2tp1-long.v3,llama3.2-1b-instruct-1n4g-fsdp2tp1.v3,llama3.2-1b-instruct-1n4g-megatron,llama3.2-1b-instruct-1n4g-megatron_generation,moonlight-16ba3b-4n4g-megatron,nano-v2-12b-1n4g-megatron,nano-v2-12b-2n4g-fsdp2tp1,qwen2.5-32b-32n4g-fsdp2tp4-actckpt-long.v3,qwen2.5-7b-instruct-4n4g-fsdp2tp2.v3,qwen2.5-7b-instruct-4n4g-megatron,qwen2.5-math-1.5b-instruct-1n4g-fsdp2tp1.v3,qwen3-30ba3b-8n4g-megatron}.yaml	Added 17 GRPO recipe configs with 4 GPUs per node, inheritance from corresponding 8-GPU base configs, checkpointing/logging/WandB paths, and model-specific parallelism settings
SFT model recipes (4 GPUs) `examples/configs/recipes/llm/sft-{gpt-oss-20b-1n4g-fsdp4ep4-automodel,llama3.1-70b-8n4g-tp2pp2-long-megatron,llama3.1-8b-1n4g-fsdp2tp1-long,llama3.2-1b-1n4g-fsdp2tp1.v3,nanov3-30BA3B-2n4g-fsdp2,nanov3-30BA3B-2n4g-fsdp2-lora,qwen2.5-math7b-2n4g-megatron}.yaml`	Added 7 SFT recipe configs with 4 GPUs per node, model-specific tensor/expert parallelism, and logging configuration
DPO model recipes (4 GPUs) `examples/configs/recipes/llm/dpo-{llama3.1-8b-instruct-4n4g-fsdp2tp1-quick.v2,llama3.1-8b-instruct-4n4g-megatrontp1pp2-quick,llama3.2-1b-instruct-1n4g-fsdp2tp1.v2}.yaml`	Added 3 DPO recipe configs with 4 GPUs per node and model parallelism settings
Distillation recipes (4 GPUs) `examples/configs/recipes/llm/distillation-{qwen3-32b-to-1.7b-base-1n4g-fsdp2tp1.v1,qwen3-32b-to-1.7b-base-1n4g-megatron-tp1pp2cp2-pack,qwen3-32b-to-4b-base-2n4g-fsdp2tp1-long.v1}.yaml`	Added 3 distillation recipe configs with 4 GPUs per node
DAPO recipes (4 GPUs) `examples/configs/recipes/llm/dapo-{qwen2.5-7b.yaml,qwen2.5-7b-16n4g-fsdp2cp2.yaml}`	Modified DAPO base config (removed grpo-specific keys, updated math_verify_impl formatting); added 16-node context-parallel variant with 4 GPUs per node
VLM recipes (4 GPUs) `examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-{1n4g-dtensor2tp1.v1,1n4g-megatrontp1.v1}.yaml`	Added 2 VLM recipe configs with 4 GPUs per node
GRPO test scripts `tests/test_suites/llm/grpo-*.sh` (18 files)	Added comprehensive test scripts for GRPO models, each defining nodes/GPUs/steps, executing training via uv run, converting TensorBoard logs to JSON, conditionally validating metrics, and cleaning checkpoints on success
SFT test scripts `tests/test_suites/llm/sft-*.sh` (7 files)	Added test scripts for SFT models following similar pattern: configuration, execution, log conversion, conditional metrics validation, cleanup
DPO test scripts `tests/test_suites/llm/dpo-*.sh` (3 files)	Added test scripts for DPO models with standard test workflow
Distillation test scripts `tests/test_suites/llm/distillation-*.sh` (3 files)	Added test scripts for distillation models including model conversion and evaluation steps
VLM test scripts `tests/test_suites/vlm/vlm_grpo-*.sh` (2 files)	Added test scripts for VLM GRPO models
Test suite manifests `tests/test_suites/nightly_gb200.txt`, `tests/test_suites/release_gb200.txt`, `tests/test_suites/README.md`	Added two new test suite lists (nightly and release) enumerating GB200-targeted test scripts; expanded README with Test Suites and GB200 Variants sections
Test infrastructure `tests/unit/test_recipes_and_test_suites.py`	Added `nightly_gb200_test_suite` and `release_gb200_test_suite` fixtures; extended `all_test_suites` fixture and parametrization to include new GB200 suites
Config CLI tooling `tools/config_cli.py`	Refactored `minimize` and `minimize-check` commands to infer base configs from child defaults chains via new `_infer_base_from_defaults()` helper; added optional `--base` override; updated defaults preservation logic when rebasing
Test suite validation `tests/unit/tools/test_config_cli.py`	Added `pytestmark` for test execution order; renamed existing test for clarity; added 7 new tests for base inference scenarios (missing defaults, list defaults, chain overrides, redundant keys, inferred vs. explicit base rebasing)
Launch script `tools/launch`	Added `GPUS_PER_NODE` default (8), made GPU-hour calculation dynamic with `GPUS_PER_NODE`, added `EXCLUDE_GRES` flag to optionally disable `--gres=gpu:...` specification in sbatch commands

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

cp: feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) (1648) into r0.5.0 #1697: Adds SFT Nemotron-3 Nano 30B A3B BF16 configuration and test-suite artifacts (sft-nanov3-* YAML and test scripts), overlapping with this PR's SFT config additions.
feat: add config_cli.py and refactor configs + config pre-commit #1024: Updates the pre-commit minimize-check hook (renames, consolidates LLM/VLM hooks, refactors per-file invocations), directly related to .pre-commit-config.yaml changes in this PR.
cp: use pydantic for yaml test validation (#1382) into r0.4.0 #1459: Modifies pre-commit minimize-check hooks and tools/config_cli.py for base/algorithm handling, related to config tooling refactoring in this PR.

Suggested labels

CI, Run CICD

Suggested reviewers

terrykong

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major changes across multiple areas (config CLI refactoring, 50+ test scripts, 70+ new configs) but lacks validation evidence in description.	Update PR description with test execution results, validation of refactored logic, config inheritance evidence, and shell script confirmation.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the primary change: adding nightly and release tests for GB200 systems.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/launch (1)
1-2: Update the NVIDIA header year to 2026.
This is a non-test shell script, and the guidelines require the current year in the header.
🛠️ Suggested fix
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.

🤖 Fix all issues with AI agents

In `@examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml`:
- Line 1: The defaults reference currently omits the local prefix; update the
value for the YAML key "defaults" from "grpo-deepscaler-1.5b-8K.yaml" to
"./grpo-deepscaler-1.5b-8K.yaml" so it matches the repository convention of
using "./" for same-directory config references (edit the line that reads
defaults: grpo-deepscaler-1.5b-8K.yaml to include the ./ prefix).

In
`@examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.yaml`:
- Around line 1-5: The defaults chain points to a megatrontp2 base (defaults:
./vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml) while this
recipe filename and checkpoint_dir use megatrontp1; update the defaults
reference to the corresponding megatrontp1 base (or alternatively rename this
file and checkpoint_dir to use megatrontp2) so the policy and TP strategy are
consistent; check the defaults value, the filename, and checkpoint_dir entries
to ensure all use the same megatrontpX identifier.

In `@tests/test_suites/llm/grpo-deepscaler-1.5b-1n4g-8K.sh`:
- Around line 35-40: The jq expression used to compute the max step for the
if-condition can produce null/empty and break the arithmetic test; update the
condition that reads JSON_METRICS with the jq lookup to provide a safe default
(0) when missing (for example by using jq's // 0 or by piping to "|| echo 0") so
the comparison [[ <value> -ge $MAX_STEPS ]] never receives an empty string;
change the invocation that currently uses jq 'to_entries | .[] | select(.key ==
"train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS to the
guarded form and leave the rest (JSON_METRICS, MAX_STEPS, and the uv run call)
unchanged.

In `@tests/test_suites/llm/grpo-gptoss-20b-8n4g-megatron.sh`:
- Around line 42-43: The unconditional deletion rm -rf "$CKPT_DIR" is dangerous
if CKPT_DIR is empty or mis-set; before that line in the script
(tests/test_suites/llm/grpo-gptoss-20b-8n4g-megatron.sh) add a sanity check that
ensures CKPT_DIR is non-empty, is an existing directory, and does not point to
root or an unexpected top-level path (e.g. "/" or "$HOME"); if you have a known
safe parent prefix (like a checkpoints base dir), also assert CKPT_DIR starts
with that prefix; only run rm -rf "$CKPT_DIR" when these checks pass and
otherwise exit with an error.

In `@tests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n4g-megatron.sh`:
- Around line 37-41: The line in the test invocation of tests/check_metrics.py
containing the expression 'mean(data["train/reward"]) > 0.56' uses a tab for
indentation while the other metric lines use spaces; update that line to use the
same space-based indentation as the surrounding lines so the uv run command and
its multi-line arguments are consistently indented (locate the multi-line call
to tests/check_metrics.py and fix the whitespace before the
'mean(data["train/reward"]) > 0.56' argument).

In `@tests/test_suites/llm/sft-gpt-oss-20b-1n4g-fsdp4ep4-automodel.sh`:
- Around line 22-24: Update the WandB project configuration to use the
standardized test project name instead of a personal/debug one: replace the
value for logger.wandb.project (currently "ruit_personal_debug") with the
consistent project name used across tests (e.g., "nemo-rl") in the shell script
lines that set logger.wandb_enabled, logger.wandb.project and logger.wandb.name
so the tests log to the shared project.

In `@tests/test_suites/llm/sft-llama3.1-8b-1n4g-fsdp2tp1-long.sh`:
- Around line 17-31: The pipeline that runs "uv run examples/run_sft.py" is
masking the actual exit code because its output is piped to "tee $RUN_LOG";
enable pipefail (set -o pipefail) or after the pipeline read the first command's
exit status from PIPESTATUS[0] and exit with that value so a failing "uv run"
propagates non‑zero. Update the script around the "uv run ... 2>&1 | tee
$RUN_LOG" invocation to either enable pipefail before running or capture
PIPESTATUS[0] immediately after and call exit with it.

In `@tests/test_suites/llm/sft-nanov3-30BA3B-2n4g-fsdp2-lora.sh`:
- Line 11: The comment for NUM_MINUTES is out of sync with its value; update
either the numeric value or the comment so they match — for example change the
comment to reflect 60 minutes ("NUM_MINUTES=60 # 60 minutes total (includes
buffer)") or set NUM_MINUTES to 18 if you intended "15 minutes + 3-minute
buffer"; modify the NUM_MINUTES assignment and its trailing comment accordingly
so they accurately describe the timeout.

In `@tests/test_suites/nightly_gb200.txt`:
- Around line 69-70: The manifest entry referencing
"distillation-qwen3-32b-to-1.7b-base-1n4g-fsdp2tp1.v1.yaml" has an incorrect
extension; change that filename to
"distillation-qwen3-32b-to-1.7b-base-1n4g-fsdp2tp1.v1.sh" so it matches the
other test script entries and the
test_all_test_scripts_accounted_for_in_test_suites glob for *.sh.

🧹 Nitpick comments (5)

tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n4g-fsdp2tp1-quick.v2.sh (1)
36-36: Harden the jq guard to avoid “integer expression expected.”

If the JSON is empty or missing train/loss, the -ge comparison can error. Consider defaulting to -1 to keep the guard stable.
♻️ Proposed hardening
-if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
+if [[ $(jq -r 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max // -1' $JSON_METRICS) -ge $MAX_STEPS ]]; then
tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n4g-fsdp2tp1.v3.sh (1)
35-40: Bind metrics step key to MAX_STEPS instead of hard-coding 500.

The metrics gate uses MAX_STEPS, but the per-step assertion is pinned to "500". If the config changes, the check and gate can diverge.
♻️ Suggested update
 # Only run metrics if the target step is reached
-if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then
+TARGET_STEP="$MAX_STEPS"
+if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $TARGET_STEP ]]; then
     uv run tests/check_metrics.py $JSON_METRICS \
         'median(data["train/token_mult_prob_error"]) < 1.1' \
-        'data["train/token_mult_prob_error"]["500"] < 1.1' \
+        "data[\"train/token_mult_prob_error\"][\"$TARGET_STEP\"] < 1.1" \
         'mean(data["timing/train/total_step_time"], -6, -1) < 10'
tests/unit/tools/test_config_cli.py (1)
375-382: Consider using capsys for stdout capture.

The manual stdout capture pattern could be replaced with pytest's capsys fixture for consistency with other tests in this file (e.g., test_minimize_in_place_and_check_with_explicit_base).
♻️ Suggested refactor using capsys
-    # Re-read child to check what minimize would output
-    # (since in_place=False, we need to capture stdout)
-    import io
-    import sys
-
-    old_stdout = sys.stdout
-    sys.stdout = captured = io.StringIO()
-    cli.minimize(ns)
-    sys.stdout = old_stdout
-    minimized = captured.getvalue()
+    # Re-read child to check what minimize would output
+    cli.minimize(ns)
+    minimized = capsys.readouterr().out
Note: This would require adding capsys: pytest.CaptureFixture[str] to the function signature.
tests/unit/test_recipes_and_test_suites.py (1)
77-96: Consider extracting a helper function for fixture logic.

The test suite loading logic is duplicated across multiple fixtures (nightly_test_suite, release_test_suite, nightly_gb200_test_suite, release_gb200_test_suite). A helper function could reduce duplication.
♻️ Optional refactor
def _load_test_suite(path: str) -> list[str]:
    """Load test suite entries from a manifest file, skipping comments and empty lines."""
    suite = []
    with open(path, "r") as f:
        for line in f:
            line = line.strip()
            if line and not line.startswith("#"):
                suite.append(line)
    return suite


`@pytest.fixture`
def nightly_gb200_test_suite():
    return _load_test_suite(nightly_gb200_test_suite_path)


`@pytest.fixture`
def release_gb200_test_suite():
    return _load_test_suite(release_gb200_test_suite_path)
This pattern could also be applied to the existing nightly_test_suite and release_test_suite fixtures.
examples/configs/recipes/llm/grpo-deepscaler-1.5b-1n4g-8K.yaml (1)

1-9: Align the filename with the LLM recipe naming convention.

The current name omits the “strategy-and-params” segment required by the repo’s LLM recipe pattern. Please rename to include the strategy portion and update any references accordingly. As per coding guidelines, ...

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

terrykong added 2 commits January 16, 2026 03:44

tests: gb200 nightlies/release

1cd31fd

Signed-off-by: Terry Kong <terryk@nvidia.com> run config cli test first b/c other run_first tests depend on its correctness Signed-off-by: Terry Kong <terryk@nvidia.com>

5 steps

bcd4d60

Signed-off-by: Terry Kong <terryk@nvidia.com>

chtruong814 requested review from a team as code owners January 16, 2026 03:47

chtruong814 added the r0.5.0 label Jan 16, 2026

github-actions Bot added the Documentation Improvements or additions to documentation label Jan 16, 2026

chtruong814 added CI:docs Run doctest and removed Documentation Improvements or additions to documentation labels Jan 16, 2026

chtruong814 temporarily deployed to nemo-ci January 16, 2026 03:47 — with GitHub Actions Inactive

terrykong previously approved these changes Jan 16, 2026

View reviewed changes

terrykong enabled auto-merge (squash) January 16, 2026 03:52

chtruong814 temporarily deployed to nemo-ci January 16, 2026 04:06 — with GitHub Actions Inactive

chtruong814 added 14 commits January 16, 2026 04:24

Revert "5 steps"

2b26cd9

This reverts commit 8ca7edc. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Update gb200 test times

482bec7

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Allow excluding gres arg in tools/launch

48451df

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix gres arg exclusion in launch script

3193fe4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Update steps per run for nano-v2 for testing launch

4b31f6d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Increase gb200 nanov3 testing

8a6564f

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Update grpo-nano-v2-12b-1n4g-megatron to run in 3 steps

0c2e56d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix gb200 configs

8768b6d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Revert "Update grpo-nano-v2-12b-1n4g-megatron to run in 3 steps"

b2294f2

This reverts commit 74d69b9. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Revert "Update steps per run for nano-v2 for testing launch"

15790f6

This reverts commit d7648ff. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix setting colocated gpus per node for llam3.1 8b test

45a3bbd

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Update time config for deepscalar and llama3.2 gb200 tests

7411397

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Adjust release distillation steps per prun

393f4b3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Pass model name to DSv3 release test config

5589a95

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 force-pushed the chtruong/update-gb200-tests branch from 7ed45c1 to 5589a95 Compare January 16, 2026 04:24

github-actions Bot added the Documentation Improvements or additions to documentation label Jan 16, 2026

chtruong814 added CI:docs Run doctest and removed CI:docs Run doctest labels Jan 16, 2026

chtruong814 temporarily deployed to nemo-ci January 16, 2026 04:25 — with GitHub Actions Inactive

chtruong814 temporarily deployed to nemo-ci January 16, 2026 04:28 — with GitHub Actions Inactive

coderabbitai Bot reviewed Jan 16, 2026

View reviewed changes

Fix wandb project name

c4bc3f0

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 dismissed terrykong’s stale review via c4bc3f0 January 16, 2026 04:58

terrykong added CI:L0 Run doctests and unit tests and removed CI:docs Run doctest labels Jan 16, 2026

terrykong temporarily deployed to nemo-ci January 16, 2026 18:04 — with GitHub Actions Inactive

terrykong approved these changes Jan 16, 2026

View reviewed changes

terrykong temporarily deployed to nemo-ci January 16, 2026 18:08 — with GitHub Actions Inactive

terrykong merged commit 7a6b8dc into main Jan 16, 2026
55 of 58 checks passed

terrykong deleted the chtruong/update-gb200-tests branch January 16, 2026 22:30

coderabbitai Bot mentioned this pull request Feb 9, 2026

test: Add script for nemotron test #1901

Closed

4 tasks

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

ci: Add nightly and release tests for gb200 (#1788)

7e1e94b

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

ci: Add nightly and release tests for gb200 (#1788)

5f9efc8

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

ci: Add nightly and release tests for gb200 (#1788)

40a383c

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Add nightly and release tests for gb200#1788

ci: Add nightly and release tests for gb200#1788
terrykong merged 17 commits intomainfrom
chtruong/update-gb200-tests

chtruong814 commented Jan 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 16, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chtruong814 commented Jan 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 16, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chtruong814 commented Jan 16, 2026 •

edited by coderabbitai Bot

Loading