ci: Add nightly and release tests for gb200#1788
Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com> run config cli test first b/c other run_first tests depend on its correctness Signed-off-by: Terry Kong <terryk@nvidia.com>
This reverts commit 8ca7edc. Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
This reverts commit 74d69b9. Signed-off-by: Charlie Truong <chtruong@nvidia.com>
This reverts commit d7648ff. Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
7ed45c1 to
5589a95
Compare
📝 WalkthroughWalkthroughThis PR introduces GB200 support by adding 4-GPU-per-node configurations. It includes documentation guidance for GB200 systems, creates 40+ new YAML recipe configs for LLM/VLM models, adds corresponding test shell scripts, establishes test suite manifests, and refactors config minimization tooling to automatically infer base configs from defaults chains. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/launch (1)
1-2: Update the NVIDIA header year to 2026.
This is a non-test shell script, and the guidelines require the current year in the header.🛠️ Suggested fix
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
🤖 Fix all issues with AI agents
In `@examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml`:
- Line 1: The defaults reference currently omits the local prefix; update the
value for the YAML key "defaults" from "grpo-deepscaler-1.5b-8K.yaml" to
"./grpo-deepscaler-1.5b-8K.yaml" so it matches the repository convention of
using "./" for same-directory config references (edit the line that reads
defaults: grpo-deepscaler-1.5b-8K.yaml to include the ./ prefix).
In
`@examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n4g-megatrontp1.v1.yaml`:
- Around line 1-5: The defaults chain points to a megatrontp2 base (defaults:
./vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml) while this
recipe filename and checkpoint_dir use megatrontp1; update the defaults
reference to the corresponding megatrontp1 base (or alternatively rename this
file and checkpoint_dir to use megatrontp2) so the policy and TP strategy are
consistent; check the defaults value, the filename, and checkpoint_dir entries
to ensure all use the same megatrontpX identifier.
In `@tests/test_suites/llm/grpo-deepscaler-1.5b-1n4g-8K.sh`:
- Around line 35-40: The jq expression used to compute the max step for the
if-condition can produce null/empty and break the arithmetic test; update the
condition that reads JSON_METRICS with the jq lookup to provide a safe default
(0) when missing (for example by using jq's // 0 or by piping to "|| echo 0") so
the comparison [[ <value> -ge $MAX_STEPS ]] never receives an empty string;
change the invocation that currently uses jq 'to_entries | .[] | select(.key ==
"train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS to the
guarded form and leave the rest (JSON_METRICS, MAX_STEPS, and the uv run call)
unchanged.
In `@tests/test_suites/llm/grpo-gptoss-20b-8n4g-megatron.sh`:
- Around line 42-43: The unconditional deletion rm -rf "$CKPT_DIR" is dangerous
if CKPT_DIR is empty or mis-set; before that line in the script
(tests/test_suites/llm/grpo-gptoss-20b-8n4g-megatron.sh) add a sanity check that
ensures CKPT_DIR is non-empty, is an existing directory, and does not point to
root or an unexpected top-level path (e.g. "/" or "$HOME"); if you have a known
safe parent prefix (like a checkpoints base dir), also assert CKPT_DIR starts
with that prefix; only run rm -rf "$CKPT_DIR" when these checks pass and
otherwise exit with an error.
In `@tests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n4g-megatron.sh`:
- Around line 37-41: The line in the test invocation of tests/check_metrics.py
containing the expression 'mean(data["train/reward"]) > 0.56' uses a tab for
indentation while the other metric lines use spaces; update that line to use the
same space-based indentation as the surrounding lines so the uv run command and
its multi-line arguments are consistently indented (locate the multi-line call
to tests/check_metrics.py and fix the whitespace before the
'mean(data["train/reward"]) > 0.56' argument).
In `@tests/test_suites/llm/sft-gpt-oss-20b-1n4g-fsdp4ep4-automodel.sh`:
- Around line 22-24: Update the WandB project configuration to use the
standardized test project name instead of a personal/debug one: replace the
value for logger.wandb.project (currently "ruit_personal_debug") with the
consistent project name used across tests (e.g., "nemo-rl") in the shell script
lines that set logger.wandb_enabled, logger.wandb.project and logger.wandb.name
so the tests log to the shared project.
In `@tests/test_suites/llm/sft-llama3.1-8b-1n4g-fsdp2tp1-long.sh`:
- Around line 17-31: The pipeline that runs "uv run examples/run_sft.py" is
masking the actual exit code because its output is piped to "tee $RUN_LOG";
enable pipefail (set -o pipefail) or after the pipeline read the first command's
exit status from PIPESTATUS[0] and exit with that value so a failing "uv run"
propagates non‑zero. Update the script around the "uv run ... 2>&1 | tee
$RUN_LOG" invocation to either enable pipefail before running or capture
PIPESTATUS[0] immediately after and call exit with it.
In `@tests/test_suites/llm/sft-nanov3-30BA3B-2n4g-fsdp2-lora.sh`:
- Line 11: The comment for NUM_MINUTES is out of sync with its value; update
either the numeric value or the comment so they match — for example change the
comment to reflect 60 minutes ("NUM_MINUTES=60 # 60 minutes total (includes
buffer)") or set NUM_MINUTES to 18 if you intended "15 minutes + 3-minute
buffer"; modify the NUM_MINUTES assignment and its trailing comment accordingly
so they accurately describe the timeout.
In `@tests/test_suites/nightly_gb200.txt`:
- Around line 69-70: The manifest entry referencing
"distillation-qwen3-32b-to-1.7b-base-1n4g-fsdp2tp1.v1.yaml" has an incorrect
extension; change that filename to
"distillation-qwen3-32b-to-1.7b-base-1n4g-fsdp2tp1.v1.sh" so it matches the
other test script entries and the
test_all_test_scripts_accounted_for_in_test_suites glob for *.sh.
🧹 Nitpick comments (5)
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n4g-fsdp2tp1-quick.v2.sh (1)
36-36: Harden the jq guard to avoid “integer expression expected.”If the JSON is empty or missing
train/loss, the-gecomparison can error. Consider defaulting to-1to keep the guard stable.♻️ Proposed hardening
-if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then +if [[ $(jq -r 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max // -1' $JSON_METRICS) -ge $MAX_STEPS ]]; thentests/test_suites/llm/grpo-llama3.2-1b-instruct-1n4g-fsdp2tp1.v3.sh (1)
35-40: Bind metrics step key toMAX_STEPSinstead of hard-coding 500.The metrics gate uses
MAX_STEPS, but the per-step assertion is pinned to"500". If the config changes, the check and gate can diverge.♻️ Suggested update
# Only run metrics if the target step is reached -if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $MAX_STEPS ]]; then +TARGET_STEP="$MAX_STEPS" +if [[ $(jq 'to_entries | .[] | select(.key == "train/loss") | .value | keys | map(tonumber) | max' $JSON_METRICS) -ge $TARGET_STEP ]]; then uv run tests/check_metrics.py $JSON_METRICS \ 'median(data["train/token_mult_prob_error"]) < 1.1' \ - 'data["train/token_mult_prob_error"]["500"] < 1.1' \ + "data[\"train/token_mult_prob_error\"][\"$TARGET_STEP\"] < 1.1" \ 'mean(data["timing/train/total_step_time"], -6, -1) < 10'tests/unit/tools/test_config_cli.py (1)
375-382: Consider usingcapsysfor stdout capture.The manual stdout capture pattern could be replaced with pytest's
capsysfixture for consistency with other tests in this file (e.g.,test_minimize_in_place_and_check_with_explicit_base).♻️ Suggested refactor using capsys
- # Re-read child to check what minimize would output - # (since in_place=False, we need to capture stdout) - import io - import sys - - old_stdout = sys.stdout - sys.stdout = captured = io.StringIO() - cli.minimize(ns) - sys.stdout = old_stdout - minimized = captured.getvalue() + # Re-read child to check what minimize would output + cli.minimize(ns) + minimized = capsys.readouterr().outNote: This would require adding
capsys: pytest.CaptureFixture[str]to the function signature.tests/unit/test_recipes_and_test_suites.py (1)
77-96: Consider extracting a helper function for fixture logic.The test suite loading logic is duplicated across multiple fixtures (
nightly_test_suite,release_test_suite,nightly_gb200_test_suite,release_gb200_test_suite). A helper function could reduce duplication.♻️ Optional refactor
def _load_test_suite(path: str) -> list[str]: """Load test suite entries from a manifest file, skipping comments and empty lines.""" suite = [] with open(path, "r") as f: for line in f: line = line.strip() if line and not line.startswith("#"): suite.append(line) return suite `@pytest.fixture` def nightly_gb200_test_suite(): return _load_test_suite(nightly_gb200_test_suite_path) `@pytest.fixture` def release_gb200_test_suite(): return _load_test_suite(release_gb200_test_suite_path)This pattern could also be applied to the existing
nightly_test_suiteandrelease_test_suitefixtures.examples/configs/recipes/llm/grpo-deepscaler-1.5b-1n4g-8K.yaml (1)
1-9: Align the filename with the LLM recipe naming convention.The current name omits the “strategy-and-params” segment required by the repo’s LLM recipe pattern. Please rename to include the strategy portion and update any references accordingly. As per coding guidelines, ...
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
What does this PR do ?
Add nightly and release tests for gb200
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Documentation
Tests
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.