Conversation
…ghtlies (#1722) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
📝 WalkthroughWalkthroughThis pull request adds a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes The core median function addition is straightforward (simple utility mirroring existing mean logic). The bulk of changes consist of highly homogeneous, repetitive pattern applications across test scripts—identical cleanup and metric replacement operations replicated across many files, which requires minimal per-file review effort despite the large number of affected files. Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In
@tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.sh:
- Around line 42-44: Before running rm -rf on CKPT_DIR, add defensive
validation: ensure the CKPT_DIR variable is set/non-empty (check -n
"$CKPT_DIR"), ensure it is not "/" or empty string (explicitly guard against "/"
and maybe "."), and ensure it exists and is a directory (check -d "$CKPT_DIR");
if any check fails, log an error and skip deletion or exit non-zero; only then
run rm -rf "$CKPT_DIR". Use the CKPT_DIR variable and the rm -rf invocation in
the script to locate where to add these guards.
🟠 Major comments (21)
tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh-40-41 (1)
40-41: Add safety check before destructive removal.While the variable is properly quoted, it's best practice to verify that
$CKPT_DIRis set and non-empty before executingrm -rfto prevent unintended deletions if the variable is undefined or empty.🔎 Proposed safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.sh-39-41 (1)
39-41: Add error checking before cleanup to prevent deleting checkpoints when tests fail.The cleanup executes regardless of whether
check_metrics.pysucceeds. Withoutset -eor an explicit exit-status check, failed metric validations will still trigger checkpoint deletion, destroying valuable debugging artifacts.Additionally, add a safety guard to verify
CKPT_DIRis set and non-empty before executingrm -rf.🔎 Proposed fix with error checking and safety guards
+ # Only clean up if metrics validation passed + if [ $? -eq 0 ] && [ -n "$CKPT_DIR" ]; then + # Clean up checkpoint directory after successful run to save space. + rm -rf "$CKPT_DIR" + fi - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"Or, for clearer intent, restructure to check the metrics first:
- uv run tests/check_metrics.py $JSON_METRICS \ - 'data["train/loss"]["1"] < 0.6' \ - 'data["train/loss"]["250"] < 0.36' \ - 'mean(data["timing/train/total_step_time"], 2) < 6' - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if uv run tests/check_metrics.py $JSON_METRICS \ + 'data["train/loss"]["1"] < 0.6' \ + 'data["train/loss"]["250"] < 0.36' \ + 'mean(data["timing/train/total_step_time"], 2) < 6'; then + # Clean up checkpoint directory after successful run to save space. + if [ -n "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fi + fitests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh-63-64 (1)
63-64: Add safety checks before destructive rm -rf operation.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis unset, empty, or mistakenly points to a critical directory. Add validation before deletion.🔎 Proposed fix with safety checks
# Clean up checkpoint directory after successful run to save space. -rm -rf "$CKPT_DIR" +if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" +fitests/test_suites/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt-long.v3.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis unset, empty, or mistakenly points to a critical directory. Add validation before deletion.🔎 Proposed fix with safety checks
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis unset, empty, or mistakenly points to a critical directory. Add validation before deletion.🔎 Proposed fix with safety checks
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis unset, empty, or mistakenly points to a critical directory. Add validation before deletion.🔎 Proposed fix with safety checks
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/sft-llama3.1-8b-1n8g-megatron.sh-40-41 (1)
40-41: Add safety checks before destructive rm -rf operation.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis unset, empty, or mistakenly points to a critical directory. Add validation before deletion.🔎 Proposed fix with safety checks
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh-40-41 (1)
40-41: Add safety validation before destructive cleanup.The
rm -rf "$CKPT_DIR"command is potentially dangerous if$CKPT_DIRis unset, empty, or points to an unexpected location. While the variable is likely set bycommon.env, defensive validation before destructive operations is a best practice.🔎 Suggested safety validation
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.sh-41-42 (1)
41-42: Add safety validation before destructive cleanup.The
rm -rf "$CKPT_DIR"command is potentially dangerous if$CKPT_DIRis unset, empty, or points to an unexpected location. While the variable is likely set bycommon.env, defensive validation before destructive operations is a best practice.🔎 Suggested safety validation
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.sh-43-44 (1)
43-44: Add safety validation before destructive cleanup.The
rm -rf "$CKPT_DIR"command is potentially dangerous if$CKPT_DIRis unset, empty, or points to an unexpected location. While the variable is likely set bycommon.env, defensive validation before destructive operations is a best practice.🔎 Suggested safety validation
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh-42-43 (1)
42-43: Add safety validation before destructive cleanup.The
rm -rf "$CKPT_DIR"command is potentially dangerous if$CKPT_DIRis unset, empty, or points to an unexpected location. While the variable is likely set bycommon.env, defensive validation before destructive operations is a best practice.🔎 Suggested safety validation
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron.sh-41-42 (1)
41-42: Add safety validation before destructive cleanup.The
rm -rf "$CKPT_DIR"command is potentially dangerous if$CKPT_DIRis unset, empty, or points to an unexpected location. While the variable is likely set bycommon.env, defensive validation before destructive operations is a best practice.🔎 Suggested safety validation
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh-40-41 (1)
40-41: Add safety check beforerm -rfto prevent unintended deletion.The
rm -rf "$CKPT_DIR"command lacks validation that$CKPT_DIRis set and non-empty, which could lead to unexpected behavior if the variable is unset or empty.🔎 Recommended fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"Or use parameter expansion:
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"tests/test_suites/llm/grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt.v3.sh-40-41 (1)
40-41: Add safety check beforerm -rfto prevent unintended deletion.The
rm -rf "$CKPT_DIR"command should include validation that$CKPT_DIRis set and non-empty to prevent unexpected behavior.🔎 Recommended fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"Or use parameter expansion:
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-long.sh-43-44 (1)
43-44: Add safety check beforerm -rfto prevent unintended deletion.The cleanup step is a good practice to save space after successful runs. However,
rm -rf "$CKPT_DIR"should validate that$CKPT_DIRis set and non-empty before execution.🔎 Recommended fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"Or use parameter expansion:
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"tests/test_suites/llm/performance/grpo-qwen3-235b-32n4g-async-1off.sh-41-42 (1)
41-42: Add safety check beforerm -rfto prevent unintended deletion.The
rm -rf "$CKPT_DIR"command should validate that$CKPT_DIRis set and non-empty before execution to prevent unexpected behavior.🔎 Recommended fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"Or use parameter expansion:
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.sh-40-41 (1)
40-41: Add safety check beforerm -rfto prevent unintended deletion.The
rm -rf "$CKPT_DIR"command is executed without verifying that$CKPT_DIRis set and non-empty. If the variable is unset or empty, this could lead to unexpected behavior.🔎 Recommended fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + [[ -n "$CKPT_DIR" ]] && rm -rf "$CKPT_DIR"Alternatively, use parameter expansion to ensure the variable is set:
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + rm -rf "${CKPT_DIR:?CKPT_DIR is not set}"tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh-46-47 (1)
46-47: Add safety guard beforerm -rf.Protect against accidental deletion by validating
$CKPT_DIRbefore therm -rfcommand:# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh-46-47 (1)
46-47: Add safety guard beforerm -rf.Add validation to ensure safe deletion:
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-qwen3-30ba3b-8n8g-megatron.sh-41-42 (1)
41-42: Add safety guard beforerm -rf.The
rm -rf "$CKPT_DIR"command poses the same safety risk as noted in the other test scripts. Please add a validation check to ensure$CKPT_DIRis non-empty and exists before deletion:# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh-40-41 (1)
40-41: Add safety guard beforerm -rf.Add a safety check to prevent accidental deletion if
$CKPT_DIRis unset or empty:# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fi
🧹 Nitpick comments (13)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)
38-40: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis unset or set incorrectly. Consider adding a safety check to verify the variable is non-empty and points to a valid checkpoint directory before removal.🔎 Proposed safety improvement
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fi - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"Additionally, the PR objectives mention "use median instead of mean for logprob error," but this file only shows checkpoint cleanup changes. Line 37 still uses
mean()for timing metrics. Can you clarify if this file was intended to include median-related changes, or if the cleanup is the only intended change for this particular test suite?tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh (1)
40-41: Consider adding a safety check before removing the checkpoint directory.While the cleanup is a good practice to save space, adding a guard to ensure
$CKPT_DIRis not empty would prevent accidental deletion if the variable is unset or misconfigured.🔎 Proposed safety enhancement
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fi fitests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron.sh (1)
42-43: LGTM! Cleanup appropriately preserves failed run artifacts.The checkpoint cleanup after successful metrics validation is good for managing disk space in nightly runs. Since the cleanup only executes when metrics pass (inside the if block), failed runs retain their checkpoints for debugging, which is the correct behavior.
Optional: Add defensive check before cleanup
If you want to be extra defensive, you could verify
CKPT_DIRis set and non-empty before removal:# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" ]] && [[ -d "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fiHowever, this is likely unnecessary given the test infrastructure should ensure proper configuration.
tests/test_suites/llm/grpo-math-llama-nemotron-super-49b-v.5-4n8g-fsdp2tp8.sh.disabled (1)
35-41: Consider adding cleanup step for consistency.Unlike other test scripts in this PR that add a checkpoint cleanup step after successful metrics validation, this file doesn't include the cleanup. While this file is disabled (
.disabledextension), consider adding the cleanup step for consistency:uv run tests/check_metrics.py $JSON_METRICS \ 'median(data["train/token_mult_prob_error"]) < 1.1' \ 'data["train/token_mult_prob_error"]["2"] < 1.1' \ 'mean(data["timing/train/policy_training"]) < 280' \ 'mean(data["ray/node.0.gpu.0.mem_gb"]) < 75' + + # Clean up checkpoint directory after successful run to save space. + rm -rf "$CKPT_DIR" fitests/test_suites/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.sh (1)
40-41: Add safety check before rm -rf to prevent unintended deletion.The cleanup step lacks validation of
$CKPT_DIRbefore deletion. If the variable is unset, empty, or malformed,rm -rfcould delete unintended paths.🔎 Proposed safety guard
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh (1)
44-45: Add safety check before rm -rf to prevent unintended deletion.The cleanup lacks validation of
$CKPT_DIR. If unset, empty, or malformed,rm -rfcould target unintended paths.🔎 Proposed safety guard
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh (1)
41-42: Add safety check before rm -rf to prevent unintended deletion.The cleanup lacks validation of
$CKPT_DIR. If unset, empty, or malformed,rm -rfcould target unintended paths.🔎 Proposed safety guard
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh (1)
40-41: Add safety check before rm -rf to prevent unintended deletion.The cleanup lacks validation of
$CKPT_DIR. If unset, empty, or malformed,rm -rfcould target unintended paths.🔎 Proposed safety guard
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh (1)
69-70: Add safety check before rm -rf to prevent unintended deletion.The cleanup lacks validation of
$CKPT_DIR. If unset, empty, or malformed,rm -rfcould target unintended paths.🔎 Proposed safety guard
# Clean up checkpoint directory after successful run to save space. +if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then -rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" +fitests/test_suites/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.sh (1)
41-42: Consider adding safety check before removing checkpoint directory.While the cleanup logic is correctly gated on successful metric checks, using
rm -rfon a variable without validation could be risky if$CKPT_DIRis unset or empty.🔎 Suggested safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-qwen2.5-7b-instruct-4n8g-fsdp2tp4.v3.sh (1)
42-43: Consider adding safety check before removing checkpoint directory.While the cleanup logic is correctly gated on successful metric checks, using
rm -rfon a variable without validation could be risky if$CKPT_DIRis unset or empty.🔎 Suggested safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.sh (1)
40-41: Consider adding safety check before removing checkpoint directory.While the cleanup logic is correctly gated on successful metric checks, using
rm -rfon a variable without validation could be risky if$CKPT_DIRis unset or empty.🔎 Suggested safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh (1)
40-41: Consider adding safety check before removing checkpoint directory.While the cleanup logic is correctly gated on successful metric checks, using
rm -rfon a variable without validation could be risky if$CKPT_DIRis unset or empty.🔎 Suggested safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && -d "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fi
|
|
||
| # Clean up checkpoint directory after successful run to save space. | ||
| rm -rf "$CKPT_DIR" |
There was a problem hiding this comment.
Add validation before rm -rf to prevent accidental data loss.
The rm -rf "$CKPT_DIR" command is dangerous without validation. If $CKPT_DIR is unset, empty, or accidentally points to a system directory, this could cause catastrophic data loss.
🔎 Proposed fix with defensive checks
# Clean up checkpoint directory after successful run to save space.
- rm -rf "$CKPT_DIR"
+ if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" && -d "$CKPT_DIR" ]]; then
+ rm -rf "$CKPT_DIR"
+ else
+ echo "Warning: CKPT_DIR is not set or invalid, skipping cleanup"
+ fiThis ensures:
$CKPT_DIRis not empty (-n)$CKPT_DIRis not the root directory$CKPT_DIRexists as a directory (-d)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Clean up checkpoint directory after successful run to save space. | |
| rm -rf "$CKPT_DIR" | |
| # Clean up checkpoint directory after successful run to save space. | |
| if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" && -d "$CKPT_DIR" ]]; then | |
| rm -rf "$CKPT_DIR" | |
| else | |
| echo "Warning: CKPT_DIR is not set or invalid, skipping cleanup" | |
| fi |
🤖 Prompt for AI Agents
In
@tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.sh
around lines 42 - 44, Before running rm -rf on CKPT_DIR, add defensive
validation: ensure the CKPT_DIR variable is set/non-empty (check -n
"$CKPT_DIR"), ensure it is not "/" or empty string (explicitly guard against "/"
and maybe "."), and ensure it exists and is a directory (check -d "$CKPT_DIR");
if any check fails, log an error and skip deletion or exit non-zero; only then
run rm -rf "$CKPT_DIR". Use the CKPT_DIR variable and the rm -rf invocation in
the script to locate where to add these guards.
…in nightlies (1722)` into `r0.5.0` (NVIDIA-NeMo#1731) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
beep boop [🤖]: Hi @terrykong 👋,
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.