cp: `fix: use median instead of mean for logprob error for stability in nightlies (1722)` into `r0.5.0` by chtruong814 · Pull Request #1731 · NVIDIA-NeMo/RL

chtruong814 · 2026-01-07T03:33:11Z

beep boop [🤖]: Hi @terrykong 👋,

we've cherry picked #1722 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

New Features
- Added median-based statistical metric validation for performance evaluations, enabling more robust threshold checks.
Tests
- Updated metric validation checks to use median instead of mean for improved statistical robustness.
- Enhanced test suite with automated checkpoint cleanup after successful runs to optimize disk space usage.
- Consolidated experiment tracking configuration across distributed test scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ghtlies (#1722) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2026-01-07T03:37:13Z

📝 Walkthrough

Walkthrough

This pull request adds a median() statistical aggregation function to the metrics evaluation framework and applies it across numerous test scripts. Additionally, checkpoint cleanup logic is systematically added to test suites, and WandB project names are standardized in distillation tests.

Changes

Cohort / File(s)	Summary
Core metrics function `tests/check_metrics.py`	Added `median(value, range_start, range_end)` function that computes median of a subset with 1-indexed range handling, mirroring existing `mean()` logic. Exposed in evaluation context as `"median": median` for use in metric checks.
Unit tests `tests/unit/test_check_metrics.py`	Updated imports to include newly exported `median` function alongside existing `evaluate_check`, `mean`, `min`, `max`, `ratio_above`.
GRPO/DPO LLM tests `tests/test_suites/llm/dapo-qwen2.5-7b.sh`, `tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2.sh`, `tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh`, `tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.sh`, `tests/test_suites/llm/dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.sh`, `tests/test_suites/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.sh`, `tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh`	Added `rm -rf "$CKPT_DIR"` cleanup after successful metrics checks. Some scripts also adjust metric thresholds (e.g., train/loss bounds). DPO-Qwen2.5-7B additionally replaces mean with median for token_mult_prob_error check.
GRPO core tests `tests/test_suites/llm/grpo-dapomath17k-dsv3-megatron.sh`, `tests/test_suites/llm/grpo-deepscaler-1.5b-.sh`, `tests/test_suites/llm/grpo-gemma3-.sh`, `tests/test_suites/llm/grpo-gptoss-20b-8n8g-megatron.sh`, `tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh`, `tests/test_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled`, `tests/test_suites/llm/grpo-llama3.1-8b-instruct-.sh`, `tests/test_suites/llm/grpo-llama3.2-1b-instruct-.sh`, `tests/test_suites/llm/grpo-math-.sh`, `tests/test_suites/llm/grpo-moonlight-16ba3b-.sh`, `tests/test_suites/llm/grpo-nano-v2-12b-.sh`, `tests/test_suites/llm/grpo-qwen2.5-.sh`, `tests/test_suites/llm/grpo-qwen3-*.sh`	Replaced `mean(data["train/token_mult_prob_error"])` with `median(data["train/token_mult_prob_error"])` in metric validation. Added `rm -rf "$CKPT_DIR"` cleanup within success branch of metrics checks.
GRPO performance tests `tests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh`, `tests/test_suites/llm/performance/grpo-deepseek-v3-.sh`, `tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-.sh`, `tests/test_suites/llm/performance/grpo-qwen3-.sh`, `tests/test_suites/llm/performance/grpo-qwen3-32b-.sh`	Replaced `mean(data["train/token_mult_prob_error"])` with `median(data["train/token_mult_prob_error"])`. Added checkpoint cleanup with `rm -rf "$CKPT_DIR"` after successful metric validation.
SFT tests `tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh`, `tests/test_suites/llm/sft-llama3.1-.sh`, `tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-.sh`, `tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh`, `tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh`	Added `rm -rf "$CKPT_DIR"` cleanup after successful metrics checks within conditional blocks.
VLM tests `tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-*.sh`	Added checkpoint cleanup with `rm -rf "$CKPT_DIR"` after successful metrics validation.
Distillation tests `tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh`, `tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-*.sh`, `tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.sh`	Updated WandB project name from `nemo-rl-distillation` to `nemo-rl`. Added `rm -rf "$CKPT_DIR"` cleanup after successful runs.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The core median function addition is straightforward (simple utility mirroring existing mean logic). The bulk of changes consist of highly homogeneous, repetitive pattern applications across test scripts—identical cleanup and metric replacement operations replicated across many files, which requires minimal per-file review effort despite the large number of affected files.

Possibly related PRs

fix: use median instead of mean for logprob error for stability in nightlies #1722: Implements the core median function and initial integration into evaluate_check; this PR extends that functionality across all test suites with systematic application of median-based checks and cleanup logic.

Suggested labels

CI:L1, r0.5.0

Suggested reviewers

terrykong
yfw

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	PR lacks quantitative evidence and testing documentation for major numeric changes to 50+ test files replacing mean with median aggregation.	Add before-and-after metrics, variance comparisons, nightly run evidence, statistical rationale, and verify no regression in affected test suites.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: replacing mean with median for logprob error calculations to improve stability in nightly tests.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents

In
@tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.sh:
- Around line 42-44: Before running rm -rf on CKPT_DIR, add defensive
validation: ensure the CKPT_DIR variable is set/non-empty (check -n
"$CKPT_DIR"), ensure it is not "/" or empty string (explicitly guard against "/"
and maybe "."), and ensure it exists and is a directory (check -d "$CKPT_DIR");
if any check fails, log an error and skip deletion or exit non-zero; only then
run rm -rf "$CKPT_DIR". Use the CKPT_DIR variable and the rm -rf invocation in
the script to locate where to add these guards.

🟠 Major comments (21)

tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh-40-41 (1)
40-41: Add safety check before destructive removal.

While the variable is properly quoted, it's best practice to verify that $CKPT_DIR is set and non-empty before executing rm -rf to prevent unintended deletions if the variable is undefined or empty.
🔎 Proposed safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.sh-39-41 (1)
39-41: Add error checking before cleanup to prevent deleting checkpoints when tests fail.

The cleanup executes regardless of whether check_metrics.py succeeds. Without set -e or an explicit exit-status check, failed metric validations will still trigger checkpoint deletion, destroying valuable debugging artifacts.

Additionally, add a safety guard to verify CKPT_DIR is set and non-empty before executing rm -rf.
🔎 Proposed fix with error checking and safety guards
+    # Only clean up if metrics validation passed
+    if [ $? -eq 0 ] && [ -n "$CKPT_DIR" ]; then
+        # Clean up checkpoint directory after successful run to save space.
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
Or, for clearer intent, restructure to check the metrics first:
-    uv run tests/check_metrics.py $JSON_METRICS \
-        'data["train/loss"]["1"] < 0.6' \
-        'data["train/loss"]["250"] < 0.36' \
-        'mean(data["timing/train/total_step_time"], 2) < 6'
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if uv run tests/check_metrics.py $JSON_METRICS \
+        'data["train/loss"]["1"] < 0.6' \
+        'data["train/loss"]["250"] < 0.36' \
+        'mean(data["timing/train/total_step_time"], 2) < 6'; then
+        # Clean up checkpoint directory after successful run to save space.
+        if [ -n "$CKPT_DIR" ]; then
+            rm -rf "$CKPT_DIR"
+        fi
+    fi
tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh-63-64 (1)
63-64: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.
🔎 Proposed fix with safety checks
 
 # Clean up checkpoint directory after successful run to save space.
-rm -rf "$CKPT_DIR"
+if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+    rm -rf "$CKPT_DIR"
+fi
 
tests/test_suites/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt-long.v3.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.
🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.
🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.
🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset, empty, or mistakenly points to a critical directory. Add validation before deletion.
🔎 Proposed fix with safety checks
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh-40-41 (1)
40-41: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.
🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.sh-41-42 (1)
41-42: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.
🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.sh-43-44 (1)
43-44: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.
🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh-42-43 (1)
42-43: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.
🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron.sh-41-42 (1)
41-42: Add safety validation before destructive cleanup.

The rm -rf "$CKPT_DIR" command is potentially dangerous if $CKPT_DIR is unset, empty, or points to an unexpected location. While the variable is likely set by common.env, defensive validation before destructive operations is a best practice.
🔎 Suggested safety validation
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh-40-41 (1)
40-41: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command lacks validation that $CKPT_DIR is set and non-empty, which could lead to unexpected behavior if the variable is unset or empty.
🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"
Or use parameter expansion:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt.v3.sh-40-41 (1)
40-41: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command should include validation that $CKPT_DIR is set and non-empty to prevent unexpected behavior.
🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"
Or use parameter expansion:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-long.sh-43-44 (1)
43-44: Add safety check before rm -rf to prevent unintended deletion.

The cleanup step is a good practice to save space after successful runs. However, rm -rf "$CKPT_DIR" should validate that $CKPT_DIR is set and non-empty before execution.
🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"
Or use parameter expansion:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh-41-42 (1)
41-42: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command should validate that $CKPT_DIR is set and non-empty before execution to prevent unexpected behavior.
🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"
Or use parameter expansion:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh-40-41 (1)
40-41: Add safety check before rm -rf to prevent unintended deletion.

The rm -rf "$CKPT_DIR" command is executed without verifying that $CKPT_DIR is set and non-empty. If the variable is unset or empty, this could lead to unexpected behavior.
🔎 Recommended fix with safety check
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"
Alternatively, use parameter expansion to ensure the variable is set:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"
tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh-46-47 (1)
46-47: Add safety guard before rm -rf.

Protect against accidental deletion by validating $CKPT_DIR before the rm -rf command:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh-46-47 (1)
46-47: Add safety guard before rm -rf.

Add validation to ensure safe deletion:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-qwen3-30ba3b-8n8g-megatron.sh-41-42 (1)
41-42: Add safety guard before rm -rf.

The rm -rf "$CKPT_DIR" command poses the same safety risk as noted in the other test scripts. Please add a validation check to ensure $CKPT_DIR is non-empty and exists before deletion:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh-40-41 (1)
40-41: Add safety guard before rm -rf.

Add a safety check to prevent accidental deletion if $CKPT_DIR is unset or empty:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then
+        rm -rf "$CKPT_DIR"
+    fi

🧹 Nitpick comments (13)

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)
38-40: Add safety check before removing checkpoint directory.

The rm -rf "$CKPT_DIR" command can be dangerous if CKPT_DIR is unset or set incorrectly. Consider adding a safety check to verify the variable is non-empty and points to a valid checkpoint directory before removal.
🔎 Proposed safety improvement
+    # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
-
-    # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
Additionally, the PR objectives mention "use median instead of mean for logprob error," but this file only shows checkpoint cleanup changes. Line 37 still uses mean() for timing metrics. Can you clarify if this file was intended to include median-related changes, or if the cleanup is the only intended change for this particular test suite?
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh (1)
40-41: Consider adding a safety check before removing the checkpoint directory.

While the cleanup is a good practice to save space, adding a guard to ensure $CKPT_DIR is not empty would prevent accidental deletion if the variable is unset or misconfigured.
🔎 Proposed safety enhancement
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
 fi
tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron.sh (1)
42-43: LGTM! Cleanup appropriately preserves failed run artifacts.

The checkpoint cleanup after successful metrics validation is good for managing disk space in nightly runs. Since the cleanup only executes when metrics pass (inside the if block), failed runs retain their checkpoints for debugging, which is the correct behavior.
Optional: Add defensive check before cleanup

If you want to be extra defensive, you could verify CKPT_DIR is set and non-empty before removal:
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" ]] && [[ -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
However, this is likely unnecessary given the test infrastructure should ensure proper configuration.
tests/test_suites/llm/grpo-math-llama-nemotron-super-49b-v.5-4n8g-fsdp2tp8.sh.disabled (1)
35-41: Consider adding cleanup step for consistency.

Unlike other test scripts in this PR that add a checkpoint cleanup step after successful metrics validation, this file doesn't include the cleanup. While this file is disabled (.disabled extension), consider adding the cleanup step for consistency:
    uv run tests/check_metrics.py $JSON_METRICS \
        'median(data["train/token_mult_prob_error"]) < 1.1' \
        'data["train/token_mult_prob_error"]["2"] < 1.1' \
        'mean(data["timing/train/policy_training"]) < 280' \
        'mean(data["ray/node.0.gpu.0.mem_gb"]) < 75'
+
+    # Clean up checkpoint directory after successful run to save space.
+    rm -rf "$CKPT_DIR"
fi
tests/test_suites/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.sh (1)
40-41: Add safety check before rm -rf to prevent unintended deletion.

The cleanup step lacks validation of $CKPT_DIR before deletion. If the variable is unset, empty, or malformed, rm -rf could delete unintended paths.
🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh (1)
44-45: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.
🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh (1)
41-42: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.
🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh (1)
40-41: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.
🔎 Proposed safety guard
 
     # Clean up checkpoint directory after successful run to save space.
+    if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-    rm -rf "$CKPT_DIR"
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh (1)
69-70: Add safety check before rm -rf to prevent unintended deletion.

The cleanup lacks validation of $CKPT_DIR. If unset, empty, or malformed, rm -rf could target unintended paths.
🔎 Proposed safety guard
 
 # Clean up checkpoint directory after successful run to save space.
+if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then
-rm -rf "$CKPT_DIR"
+    rm -rf "$CKPT_DIR"
+fi
tests/test_suites/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.sh (1)
41-42: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.
🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4.v3.sh (1)
42-43: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.
🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.sh (1)
40-41: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.
🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh (1)
40-41: Consider adding safety check before removing checkpoint directory.

While the cleanup logic is correctly gated on successful metric checks, using rm -rf on a variable without validation could be risky if $CKPT_DIR is unset or empty.
🔎 Suggested safety check
 
     # Clean up checkpoint directory after successful run to save space.
-    rm -rf "$CKPT_DIR"
+    if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then
+        rm -rf "$CKPT_DIR"
+    fi

coderabbitai · 2026-01-07T03:37:16Z

+
+    # Clean up checkpoint directory after successful run to save space.
+    rm -rf "$CKPT_DIR"


⚠️ Potential issue | 🔴 Critical

Add validation before rm -rf to prevent accidental data loss.

The rm -rf "$CKPT_DIR" command is dangerous without validation. If $CKPT_DIR is unset, empty, or accidentally points to a system directory, this could cause catastrophic data loss.

🔎 Proposed fix with defensive checks

# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" && -d "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + else + echo "Warning: CKPT_DIR is not set or invalid, skipping cleanup" + fi

This ensures:

$CKPT_DIR is not empty (-n)

$CKPT_DIR is not the root directory

$CKPT_DIR exists as a directory (-d)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Clean up checkpoint directory after successful run to save space.

rm -rf "$CKPT_DIR"

# Clean up checkpoint directory after successful run to save space.

if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" && -d "$CKPT_DIR" ]]; then

rm -rf "$CKPT_DIR"

else

echo "Warning: CKPT_DIR is not set or invalid, skipping cleanup"

fi

🤖 Prompt for AI Agents

In @tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.sh around lines 42 - 44, Before running rm -rf on CKPT_DIR, add defensive validation: ensure the CKPT_DIR variable is set/non-empty (check -n "$CKPT_DIR"), ensure it is not "/" or empty string (explicitly guard against "/" and maybe "."), and ensure it exists and is a directory (check -d "$CKPT_DIR"); if any check fails, log an error and skip deletion or exit non-zero; only then run rm -rf "$CKPT_DIR". Use the CKPT_DIR variable and the rm -rf invocation in the script to locate where to add these guards.

…in nightlies (1722)` into `r0.5.0` (NVIDIA-NeMo#1731) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>

fix: use median instead of mean for logprob error for stability in ni…

7c7c9f8

…ghtlies (#1722) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

chtruong814 requested a review from a team as a code owner January 7, 2026 03:33

chtruong814 requested a review from terrykong January 7, 2026 03:33

chtruong814 added cherry-pick Run CICD labels Jan 7, 2026

coderabbitai Bot reviewed Jan 7, 2026

View reviewed changes

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jan 7, 2026

yuki-97 temporarily deployed to nemo-ci January 7, 2026 05:09 — with GitHub Actions Inactive

yuki-97 approved these changes Jan 7, 2026

View reviewed changes

yuki-97 enabled auto-merge (squash) January 7, 2026 05:09

yuki-97 temporarily deployed to nemo-ci January 7, 2026 05:14 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci January 7, 2026 08:56 — with GitHub Actions Inactive

yuki-97 merged commit 77993ea into r0.5.0 Jan 7, 2026
64 of 71 checks passed

yuki-97 deleted the cherry-pick-1722-r0.5.0 branch January 7, 2026 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `fix: use median instead of mean for logprob error for stability in nightlies (1722)` into `r0.5.0`#1731

cp: `fix: use median instead of mean for logprob error for stability in nightlies (1722)` into `r0.5.0`#1731
yuki-97 merged 1 commit intor0.5.0from
cherry-pick-1722-r0.5.0

chtruong814 commented Jan 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 7, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		# Clean up checkpoint directory after successful run to save space.
		rm -rf "$CKPT_DIR"

Conversation

chtruong814 commented Jan 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 7, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chtruong814 commented Jan 7, 2026 •

edited by coderabbitai Bot

Loading