Skip to content

cp: fix: use median instead of mean for logprob error for stability in nightlies (1722) into r0.5.0#1731

Merged
yuki-97 merged 1 commit intor0.5.0from
cherry-pick-1722-r0.5.0
Jan 7, 2026
Merged

cp: fix: use median instead of mean for logprob error for stability in nightlies (1722) into r0.5.0#1731
yuki-97 merged 1 commit intor0.5.0from
cherry-pick-1722-r0.5.0

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Jan 7, 2026

beep boop [🤖]: Hi @terrykong 👋,

we've cherry picked #1722 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features

    • Added median-based statistical metric validation for performance evaluations, enabling more robust threshold checks.
  • Tests

    • Updated metric validation checks to use median instead of mean for improved statistical robustness.
    • Enhanced test suite with automated checkpoint cleanup after successful runs to optimize disk space usage.
    • Consolidated experiment tracking configuration across distributed test scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

…ghtlies (#1722)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

This pull request adds a median() statistical aggregation function to the metrics evaluation framework and applies it across numerous test scripts. Additionally, checkpoint cleanup logic is systematically added to test suites, and WandB project names are standardized in distillation tests.

Changes

Cohort / File(s) Summary
Core metrics function
tests/check_metrics.py
Added median(value, range_start, range_end) function that computes median of a subset with 1-indexed range handling, mirroring existing mean() logic. Exposed in evaluation context as "median": median for use in metric checks.
Unit tests
tests/unit/test_check_metrics.py
Updated imports to include newly exported median function alongside existing evaluate_check, mean, min, max, ratio_above.
GRPO/DPO LLM tests
tests/test_suites/llm/dapo-qwen2.5-7b.sh, tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2.sh, tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh, tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.sh, tests/test_suites/llm/dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.sh, tests/test_suites/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.sh, tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh
Added rm -rf "$CKPT_DIR" cleanup after successful metrics checks. Some scripts also adjust metric thresholds (e.g., train/loss bounds). DPO-Qwen2.5-7B additionally replaces mean with median for token_mult_prob_error check.
GRPO core tests
tests/test_suites/llm/grpo-dapomath17k-dsv3-megatron.sh, tests/test_suites/llm/grpo-deepscaler-1.5b-*.sh, tests/test_suites/llm/grpo-gemma3-*.sh, tests/test_suites/llm/grpo-gptoss-20b-8n8g-megatron.sh, tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh, tests/test_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled, tests/test_suites/llm/grpo-llama3.1-8b-instruct-*.sh, tests/test_suites/llm/grpo-llama3.2-1b-instruct-*.sh, tests/test_suites/llm/grpo-math-*.sh, tests/test_suites/llm/grpo-moonlight-16ba3b-*.sh, tests/test_suites/llm/grpo-nano-v2-12b-*.sh, tests/test_suites/llm/grpo-qwen2.5-*.sh, tests/test_suites/llm/grpo-qwen3-*.sh
Replaced mean(data["train/token_mult_prob_error"]) with median(data["train/token_mult_prob_error"]) in metric validation. Added rm -rf "$CKPT_DIR" cleanup within success branch of metrics checks.
GRPO performance tests
tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh, tests/test_suites/llm/performance/grpo-deepseek-v3-*.sh, tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-*.sh, tests/test_suites/llm/performance/grpo-qwen3-*.sh, tests/test_suites/llm/performance/grpo-qwen3-32b-*.sh
Replaced mean(data["train/token_mult_prob_error"]) with median(data["train/token_mult_prob_error"]). Added checkpoint cleanup with rm -rf "$CKPT_DIR" after successful metric validation.
SFT tests
tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh, tests/test_suites/llm/sft-llama3.1-*.sh, tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh, tests/test_suites/llm/sft-nanov3-30BA3B-*.sh, tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh, tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh
Added rm -rf "$CKPT_DIR" cleanup after successful metrics checks within conditional blocks.
VLM tests
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-*.sh
Added checkpoint cleanup with rm -rf "$CKPT_DIR" after successful metrics validation.
Distillation tests
tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh, tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-*.sh, tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.sh
Updated WandB project name from nemo-rl-distillation to nemo-rl. Added rm -rf "$CKPT_DIR" cleanup after successful runs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The core median function addition is straightforward (simple utility mirroring existing mean logic). The bulk of changes consist of highly homogeneous, repetitive pattern applications across test scripts—identical cleanup and metric replacement operations replicated across many files, which requires minimal per-file review effort despite the large number of affected files.

Possibly related PRs

Suggested labels

CI:L1, r0.5.0

Suggested reviewers

  • terrykong
  • yfw

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning PR lacks quantitative evidence and testing documentation for major numeric changes to 50+ test files replacing mean with median aggregation. Add before-and-after metrics, variance comparisons, nightly run evidence, statistical rationale, and verify no regression in affected test suites.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: replacing mean with median for logprob error calculations to improve stability in nightly tests.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In
@tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.sh:
- Around line 42-44: Before running rm -rf on CKPT_DIR, add defensive
validation: ensure the CKPT_DIR variable is set/non-empty (check -n
"$CKPT_DIR"), ensure it is not "/" or empty string (explicitly guard against "/"
and maybe "."), and ensure it exists and is a directory (check -d "$CKPT_DIR");
if any check fails, log an error and skip deletion or exit non-zero; only then
run rm -rf "$CKPT_DIR". Use the CKPT_DIR variable and the rm -rf invocation in
the script to locate where to add these guards.
🟠 Major comments (21)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh-40-41 (1)

40-41: Add safety check before destructive removal.

While the variable is properly quoted, it's best practice to verify that $CKPT_DIR is set and non-empty before executing rm -rf to prevent unintended deletions if the variable is undefined or empty.

🔎 Proposed safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.sh-39-41 (1)

39-41: Add error checking before cleanup to prevent deleting checkpoints when tests fail.

The cleanup executes regardless of whether check_metrics.py succeeds. Without set -e or an explicit exit-status check, failed metric validations will still trigger checkpoint deletion, destroying valuable debugging artifacts.

Additionally, add a safety guard to verify CKPT_DIR is set and non-empty before executing rm -rf.

🔎 Proposed fix with error checking and safety guards
+    # Only clean up if metrics validation passed
+    if [ $? -eq 0 ] && [ -n "$CKPT_DIR" ]; then
+        # Clean up checkpoint directory after successful run to save space.
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"

Or, for clearer intent, restructure to check the metrics first:

-    uv run tests/check_metrics.py $JSON_METRICS \
-        'data["train/loss"]["1"] < 0.6' \
-        'data["train/loss"]["250"] < 0.36' \
-        'mean(data["timing/train/total_step_time"], 2) < 6'
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if uv run tests/check_metrics.py $JSON_METRICS \
+        'data["train/loss"]["1"] < 0.6' \
+        'data["train/loss"]["250"] < 0.36' \
+        'mean(data["timing/train/total_step_time"], 2) < 6'; then
+        # Clean up checkpoint directory after successful run to save space.
+        if [ -n "$CKPT_DIR" ]; then
+            rm -rf "$CKPT_DIR"
+        fi
+    fi
tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh-63-64 (1)

63-64: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.

🔎 Proposed fix with safety checks
 
 # Clean up checkpoint directory after successful run to save space.
-rm -rf "$CKPT_DIR"
+if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+    rm -rf "$CKPT_DIR"
+fi
 
tests/test_suites/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt-long.v3.sh-40-41 (1)

40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.

🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh-40-41 (1)

40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.

🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.sh-40-41 (1)

40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.

🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron.sh-40-41 (1)

40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.

🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh-40-41 (1)

40-41: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.

🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.sh-41-42 (1)

41-42: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.

🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.sh-43-44 (1)

43-44: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.

🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh-42-43 (1)

42-43: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.

🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron.sh-41-42 (1)

41-42: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.

🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh-40-41 (1)

40-41: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command lacks validation that $CKPT_DIR is set and non-empty, which could lead to unexpected behavior if the variable is unset or empty.

🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"

Or use parameter expansion:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt.v3.sh-40-41 (1)

40-41: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command should include validation that $CKPT_DIR is set and non-empty to prevent unexpected behavior.

🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"

Or use parameter expansion:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-long.sh-43-44 (1)

43-44: Add safety check before rm -rf to prevent unintended deletion.

The cleanup step is a good practice to save space after successful runs. However, rm -rf "$CKPT_DIR" should validate that $CKPT_DIR is set and non-empty before execution.

🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"

Or use parameter expansion:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh-41-42 (1)

41-42: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command should validate that $CKPT_DIR is set and non-empty before execution to prevent unexpected behavior.

🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"

Or use parameter expansion:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh-40-41 (1)

40-41: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command is executed without verifying that $CKPT_DIR is set and non-empty. If the variable is unset or empty, this could lead to unexpected behavior.

🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"

Alternatively, use parameter expansion to ensure the variable is set:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh-46-47 (1)

46-47: Add safety guard before rm -rf.

Protect against accidental deletion by validating $CKPT_DIR before the rm -rf command:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh-46-47 (1)

46-47: Add safety guard before rm -rf.

Add validation to ensure safe deletion:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-qwen3-30ba3b-8n8g-megatron.sh-41-42 (1)

41-42: Add safety guard before rm -rf.

The rm -rf "$CKPT_DIR" command poses the same safety risk as noted in the other test scripts. Please add a validation check to ensure $CKPT_DIR is non-empty and exists before deletion:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh-40-41 (1)

40-41: Add safety guard before rm -rf.

Add a safety check to prevent accidental deletion if $CKPT_DIR is unset or empty:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
🧹 Nitpick comments (13)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)

38-40: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset or set incorrectly. Consider adding a safety check to verify the variable is non-empty and points to a valid checkpoint directory before removal.

🔎 Proposed safety improvement
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"

Additionally, the PR objectives mention "use median instead of mean for logprob error," but this file only shows checkpoint cleanup changes. Line 37 still uses mean() for timing metrics. Can you clarify if this file was intended to include median-related changes, or if the cleanup is the only intended change for this particular test suite?

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh (1)

40-41: Consider adding a safety check before removing the checkpoint directory.

While the cleanup is a good practice to save space, adding a guard to ensure $CKPT_DIR is not empty would prevent accidental deletion if the variable is unset or misconfigured.

🔎 Proposed safety enhancement
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
 fi
tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron.sh (1)

42-43: LGTM! Cleanup appropriately preserves failed run artifacts.

The checkpoint cleanup after successful metrics validation is good for managing disk space in nightly runs. Since the cleanup only executes when metrics pass (inside the if block), failed runs retain their checkpoints for debugging, which is the correct behavior.

Optional: Add defensive check before cleanup

If you want to be extra defensive, you could verify CKPT_DIR is set and non-empty before removal:

     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" ]] && [[ -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi

However, this is likely unnecessary given the test infrastructure should ensure proper configuration.

tests/test_suites/llm/grpo-math-llama-nemotron-super-49b-v.5-4n8g-fsdp2tp8.sh.disabled (1)

35-41: Consider adding cleanup step for consistency.

Unlike other test scripts in this PR that add a checkpoint cleanup step after successful metrics validation, this file doesn't include the cleanup. While this file is disabled (.disabled extension), consider adding the cleanup step for consistency:

    uv run tests/check_metrics.py $JSON_METRICS \
        'median(data["train/token_mult_prob_error"]) < 1.1' \
        'data["train/token_mult_prob_error"]["2"] < 1.1' \
        'mean(data["timing/train/policy_training"]) < 280' \
        'mean(data["ray/node.0.gpu.0.mem_gb"]) < 75'
+
+    # Clean up checkpoint directory after successful run to save space.
+    rm -rf "$CKPT_DIR"
fi
tests/test_suites/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.sh (1)

40-41: Add safety check before rm -rf to prevent unintended deletion.

The cleanup step lacks validation of $CKPT_DIR before deletion. If the variable is unset, empty, or malformed, rm -rf could delete unintended paths.

🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh (1)

44-45: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.

🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh (1)

41-42: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.

🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh (1)

40-41: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.

🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh (1)

69-70: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.

🔎 Proposed safety guard
 
 # Clean up checkpoint directory after successful run to save space.
+if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-rm -rf "$CKPT_DIR"
+    rm -rf "$CKPT_DIR"
+fi
tests/test_suites/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.sh (1)

41-42: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.

🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4.v3.sh (1)

42-43: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.

🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.sh (1)

40-41: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.

🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh (1)

40-41: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.

🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi

Comment on lines +42 to +44

# Clean up checkpoint directory after successful run to save space.
rm -rf "$CKPT_DIR"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add validation before rm -rf to prevent accidental data loss.

The rm -rf "$CKPT_DIR" command is dangerous without validation. If $CKPT_DIR is unset, empty, or accidentally points to a system directory, this could cause catastrophic data loss.

🔎 Proposed fix with defensive checks
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    else
+        echo "Warning: CKPT_DIR is not set or invalid, skipping cleanup"
+    fi

This ensures:

  • $CKPT_DIR is not empty (-n)
  • $CKPT_DIR is not the root directory
  • $CKPT_DIR exists as a directory (-d)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Clean up checkpoint directory after successful run to save space.
rm -rf "$CKPT_DIR"
# Clean up checkpoint directory after successful run to save space.
if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" && -d "$CKPT_DIR" ]]; then
rm -rf "$CKPT_DIR"
else
echo "Warning: CKPT_DIR is not set or invalid, skipping cleanup"
fi
🤖 Prompt for AI Agents
In
@tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.sh
around lines 42 - 44, Before running rm -rf on CKPT_DIR, add defensive
validation: ensure the CKPT_DIR variable is set/non-empty (check -n
"$CKPT_DIR"), ensure it is not "/" or empty string (explicitly guard against "/"
and maybe "."), and ensure it exists and is a directory (check -d "$CKPT_DIR");
if any check fails, log an error and skip deletion or exit non-zero; only then
run rm -rf "$CKPT_DIR". Use the CKPT_DIR variable and the rm -rf invocation in
the script to locate where to add these guards.

@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jan 7, 2026
@yuki-97 yuki-97 enabled auto-merge (squash) January 7, 2026 05:09
@yuki-97 yuki-97 merged commit 77993ea into r0.5.0 Jan 7, 2026
64 of 71 checks passed
@yuki-97 yuki-97 deleted the cherry-pick-1722-r0.5.0 branch January 7, 2026 11:42
avenkateshha pushed a commit to avenkateshha/RL that referenced this pull request Apr 10, 2026
…in nightlies (1722)` into `r0.5.0` (NVIDIA-NeMo#1731)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants