fix: use median instead of mean for logprob error for stability in nightlies#1722
fix: use median instead of mean for logprob error for stability in nightlies#1722
Conversation
…ghtlies Signed-off-by: Terry Kong <terryk@nvidia.com>
📝 WalkthroughWalkthroughIntroduces a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🤖 Fix all issues with AI Agents
In @tests/check_metrics.py:
- Around line 100-130: The median function can call statistics.median on an
empty list (vals) which raises StatisticsError; update the median(value,
range_start=1, range_end=0) function to check if vals is empty before calling
statistics.median and return a sensible default (e.g., None or float('nan')) or
raise a controlled error; locate the vals list and the final return line in the
median function and add the empty-list guard right before the return so the
function handles ranges that yield no values gracefully.
In @tests/test_suites/llm/dapo-qwen2.5-7b.sh:
- Around line 42-43: Validate CKPT_DIR before running the cleanup: ensure the
CKPT_DIR variable is set and non-empty, confirm it points to an existing
directory (e.g., test [ -n "$CKPT_DIR" ] and [ -d "$CKPT_DIR" ]), and guard
against dangerous paths like "/" or "." (reject if "$CKPT_DIR" == "/" or empty)
before executing rm -rf "$CKPT_DIR"; update the block that currently runs rm -rf
"$CKPT_DIR" to perform these checks and only remove when they pass.
In
@tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:
- Around line 41-43: Before running rm -rf "$CKPT_DIR", validate CKPT_DIR:
ensure the variable is set and non-empty, verify it points to an existing
directory (not "/" or other critical root paths), and optionally confirm it
matches an expected safe prefix (e.g., a known tmp/checkpoints base) or prompt
for confirmation; if any check fails, log an error and abort instead of
performing rm -rf.
- Line 22: The wandb project name was changed from "nemo-rl-distillation" to
"nemo-rl" in the configuration key logger.wandb.project; verify this is
intentional. If it was accidental, revert logger.wandb.project back to
"nemo-rl-distillation"; if intentional, update the PR description and any
downstream references (CI configs, dashboards, experiment tracking docs) to
reflect the new project name so metric tracking and organization are not
disrupted.
In
@tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-seqpack.v1.sh:
- Around line 41-43: The cleanup line unconditionally runs rm -rf "$CKPT_DIR"
which is dangerous if CKPT_DIR is empty or unset; add a safety guard before
deletion that verifies CKPT_DIR is set and non-empty (and ideally not "/" or
"/"-like paths) and only then run the removal; reference CKPT_DIR and the rm -rf
call so you replace that direct removal with a conditional check that aborts or
logs an error when CKPT_DIR is missing or unsafe.
In @tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.sh:
- Around line 43-45: Add a safety check before the unconditional rm -rf by
validating the CKPT_DIR variable and its value: ensure CKPT_DIR is set and
non-empty, reject dangerous values such as "/" or "." and verify the path exists
(e.g., [[ -n "$CKPT_DIR" ]] && [[ "$CKPT_DIR" != "/" ]] && [[ "$CKPT_DIR" != "."
]] && [ -e "$CKPT_DIR" ]), then only perform rm -rf "$CKPT_DIR"; if validation
fails, log an error and skip deletion.
In @tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh:
- Around line 44-45: The cleanup uses rm -rf "$CKPT_DIR" with no safety checks;
add validation to ensure CKPT_DIR is set, non-empty, points to an existing
directory, and is not a root or obvious dangerous path (e.g., "/" or empty
string) before running rm -rf. Implement checks around the CKPT_DIR variable
(validate non-empty, [ -d "$CKPT_DIR" ], resolve with realpath and ensure it’s
not "/" and optionally ensure it matches an expected prefix or pattern), log or
exit with an error if validation fails, and only perform rm -rf "$CKPT_DIR"
after those guards pass.
In @tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.sh:
- Around line 43-45: The cleanup line unconditionally runs rm -rf "$CKPT_DIR",
which is dangerous if CKPT_DIR is unset or empty; update the script to validate
CKPT_DIR before deletion by checking it is non-empty, points to an existing
directory, and is not a root or critical path (e.g., not "/" or empty), and only
then perform the removal; adjust the block around the rm invocation that
references $CKPT_DIR to bail out with an error if the checks fail.
In @tests/test_suites/llm/dpo-llama3.1-8b-tulu3-1n8g-fsdp2tp1.sh:
- Around line 42-44: The cleanup line unconditionally runs rm -rf "$CKPT_DIR",
which is unsafe if CKPT_DIR is empty or malicious; update the script to validate
CKPT_DIR before deletion: ensure the variable is set and non-empty, optionally
resolve and verify it is under an expected base directory (e.g., using realpath)
and not "/" or "/"-like paths, and only then perform rm -rf "$CKPT_DIR";
reference the CKPT_DIR variable and the cleanup block in the test_suites/llm
script when adding these guards.
In @tests/test_suites/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.sh:
- Around line 41-43: The cleanup line using rm -rf "$CKPT_DIR" is unsafe if
CKPT_DIR is empty or points to root; before removal, verify CKPT_DIR is
non-empty, not equal to "/" (or other critical paths), and refers to an existing
directory, then perform the removal; update the script around the rm -rf
invocation to perform these guards (check CKPT_DIR variable, check it’s not "/"
and that the path exists) and only then run the delete command.
In
@tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh:
- Around line 41-43: The cleanup currently runs rm -rf "$CKPT_DIR" without
validating CKPT_DIR; change it to first verify CKPT_DIR is set/non-empty and
points to a safe directory (e.g., check [ -n "$CKPT_DIR" ] and that it is not
"/" and optionally that it exists and is a directory) before performing the
removal; update the block around the rm -rf invocation (the CKPT_DIR variable
usage) to only execute removal when those validations pass and log or fail
otherwise.
In @tests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh:
- Around line 69-70: The cleanup command unconditionally removes "$CKPT_DIR"
even if the second metrics check failed or if MAX_STEPS was not reached; update
the script so cleanup only runs when all relevant checks pass: either move the
conversion, eval, second metrics check and rm -rf "$CKPT_DIR" inside the
existing if [[ $(jq ...) -ge $MAX_STEPS ]] block, or add an explicit guard that
verifies the first metrics check succeeded and the second check returned success
before executing rm -rf "$CKPT_DIR"; use the existing jq MAX_STEPS check and the
variable/results from the second metrics check to gate the CKPT_DIR removal.
In @tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh:
- Around line 63-64: The cleanup uses rm -rf "$CKPT_DIR" without validating
CKPT_DIR; add a safety check before that destructive command to ensure CKPT_DIR
is set and non-empty and not equal to "/" (and optionally confirm it exists and
is under the expected parent directory) and only then run the rm -rf
"$CKPT_DIR"; locate the rm -rf invocation and wrap it with the validation logic
using the CKPT_DIR variable to prevent accidental deletion.
In @tests/test_suites/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.sh:
- Around line 41-42: The cleanup line blindly runs rm -rf "$CKPT_DIR"; add a
safety guard that ensures CKPT_DIR is non-empty and points to an existing
directory (e.g., test that CKPT_DIR is not empty and [ -d "$CKPT_DIR" ]), and
also reject dangerous values like "/" or empty string before calling rm -rf;
update the script around the CKPT_DIR cleanup to perform these checks and only
run rm -rf "$CKPT_DIR" when the conditions pass.
In @tests/test_suites/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.sh:
- Around line 39-41: Add a safety guard before the unconditional rm -rf by
validating CKPT_DIR: ensure the variable is set and non-empty, verify it points
to an existing directory (not "/" or empty string), and optionally check it
matches an expected path pattern or prefix before performing rm -rf "$CKPT_DIR";
update the cleanup block around the rm -rf command (reference CKPT_DIR) to
perform these checks and only delete when all validations pass.
In @tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh:
- Around line 40-41: Add a safety validation before the cleanup that ensures the
CKPT_DIR variable is set and non-empty and points to an existing directory
before calling rm -rf; locate the cleanup block where CKPT_DIR is used (the rm
-rf "$CKPT_DIR" line) and add a guard that checks CKPT_DIR is not empty and that
[ -d "$CKPT_DIR" ] (and optionally not equal to "/" or "."), only then perform
the removal, otherwise log/exit to avoid accidental deletion.
In
@tests/test_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled:
- Around line 37-40: Add the missing checkpoint cleanup after the metrics
validation block by removing the checkpoint directory variable (CKPT_DIR) when
the validation passes; locate the section containing the validation expression
'median(data["train/token_mult_prob_error"]) < 1.1' /
"data['train/token_mult_prob_error']['$MAX_STEPS'] < 1.1" and append the same
cleanup command used in the other test suites (rm -rf "$CKPT_DIR") so the script
follows the PR pattern and cleans up checkpoints on success.
In
@tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh:
- Around line 42-43: The cleanup currently runs rm -rf "$CKPT_DIR"
unconditionally which is dangerous if CKPT_DIR is unset/empty or points to a
root path; before removing, validate CKPT_DIR: ensure it is set and non-empty,
ensure it is an existing directory (not a file), guard against dangerous values
like "/" or an unexpected top-level path (optionally check it matches an
expected prefix), and only then run the removal; update the cleanup around the
rm -rf invocation (the CKPT_DIR variable and the rm -rf "$CKPT_DIR" line) to
perform these checks and fail-safe if validation does not pass.
In @tests/test_suites/llm/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e.sh:
- Around line 40-41: The deletion line using rm -rf "$CKPT_DIR" must be guarded:
ensure the CKPT_DIR variable is non-empty and points to an existing directory
before removing; update the script around that deletion to check [[ -n
"$CKPT_DIR" ]] and [[ -d "$CKPT_DIR" ]], quote the variable when used, and
invoke rm with the -- separator (e.g., rm -rf -- "$CKPT_DIR") only after the
checks pass; otherwise, log or exit with a clear error instead of running rm.
In @tests/test_suites/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.sh:
- Around line 40-41: Before running the dangerous rm command, verify CKPT_DIR is
non-empty, points to an existing directory and is not root ("/") or "/"-like;
replace the bare rm -rf "$CKPT_DIR" with a guarded sequence such as checking [[
-n "$CKPT_DIR" ]] && [ -d "$CKPT_DIR" ] && [ "$CKPT_DIR" != "/" ] (and
optionally ensure it is not "." or "$PWD") then call rm -rf -- "$CKPT_DIR";
reference the CKPT_DIR variable and the rm -rf invocation when making the
change.
In @tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.sh:
- Around line 41-42: The cleanup step unconditionally runs rm -rf "$CKPT_DIR"
which is dangerous if CKPT_DIR is unset/empty or points to "/" — add a safety
guard before removal: ensure CKPT_DIR is non-empty and not "/" (and ideally not
"."), then invoke rm with a safe form (use -- to avoid treating leading dashes);
reference CKPT_DIR and the rm -rf call and only proceed to delete when those
checks pass.
In @tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.sh:
- Around line 42-43: Add a safety check before the destructive rm -rf operation:
verify CKPT_DIR is non-empty and not '/' (and optionally not '.'), and only then
run rm -rf -- "$CKPT_DIR"; otherwise print an error and exit non-zero.
Specifically, replace the raw rm -rf "$CKPT_DIR" with a guard like checking [[
-n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]] (and "$CKPT_DIR" != ".") before
executing rm -rf -- "$CKPT_DIR" to prevent accidental deletion when CKPT_DIR is
unset or invalid.
In @tests/test_suites/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.sh:
- Around line 36-41: The cleanup command rm -rf "$CKPT_DIR" runs
unconditionally; make it conditional on the metrics-check succeeding by either
adding set -e near the top (after sourcing common.env) so the script will exit
on any failure, or check the exit status of the uv run invocation for
tests/check_metrics.py (e.g., run the command and only execute rm -rf
"$CKPT_DIR" if its exit code is 0) so checkpoints are only removed after a
successful run.
In @tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron.sh:
- Around line 36-43: The cleanup runs unconditionally and may delete checkpoints
even when tests/check_metrics.py fails; guard the rm -rf "$CKPT_DIR" so it only
runs on success by checking the exit status of the metrics command (or enable
shell error exit with set -e at the top) and only perform the rm when
tests/check_metrics.py returns zero; reference the invocation of
tests/check_metrics.py and the rm -rf "$CKPT_DIR" in your change and ensure
failed runs preserve $CKPT_DIR for debugging.
In @tests/test_suites/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.sh:
- Around line 41-42: Before running the destructive rm -rf, validate CKPT_DIR:
ensure the variable is set and non-empty, confirm it points to an existing
directory and is not a dangerous path like "/" or empty, and bail out with an
error if the checks fail; then proceed to remove the directory. Specifically
update the rm -rf "$CKPT_DIR" site to perform these checks on CKPT_DIR and only
call rm -rf when all validations pass.
In @tests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh:
- Around line 46-47: The script currently removes the checkpoint directory
unconditionally with rm -rf "$CKPT_DIR"; change this so cleanup only runs after
metrics validation succeeds by making the rm conditional on the exit status of
the metrics check (e.g., run check_metrics.py and only execute rm -rf
"$CKPT_DIR" if that command exits zero or chain with &&), or explicitly test the
return code ($?) and skip deletion on failure so artifacts remain for debugging;
reference the check_metrics.py invocation and the CKPT_DIR / rm -rf "$CKPT_DIR"
lines to locate where to apply the change.
In @tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh:
- Around line 45-47: The rm -rf "$CKPT_DIR" is unsafe if CKPT_DIR is empty or
misset; before deletion validate that CKPT_DIR is set and non-empty, that it
points to an existing directory and is not a dangerous path (e.g., "/" or "." or
empty string), and only then perform the removal; update the cleanup block to
perform these checks around the CKPT_DIR variable and skip/abort deletion with
an error if validation fails.
In @tests/test_suites/llm/performance/grpo-deepseek-v3-64n4g-async-1off.sh:
- Around line 42-47: The cleanup currently runs unconditionally after the
metrics check; change it so rm -rf "$CKPT_DIR" only runs when the metrics
validation succeeds by capturing the exit status of the uv run
tests/check_metrics.py invocation (the command starting with "uv run
tests/check_metrics.py $JSON_METRICS ...") and conditionally executing the
removal on a zero exit code; if non-zero, print or log a failure message and
skip deleting "$CKPT_DIR" so artifacts are preserved for debugging.
In @tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh:
- Around line 46-47: Add a safety check before running rm -rf "$CKPT_DIR":
verify the CKPT_DIR variable is set and non-empty, ensure it is not "/" or the
root directory, and optionally confirm it matches an expected base path or
pattern (e.g., contains a known workspace prefix) before performing deletion; if
the check fails, log an error and skip removal. Use the CKPT_DIR variable and
the rm -rf invocation locations in the script to implement these guards.
In @tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh:
- Around line 46-47: The cleanup step unconditionally runs rm -rf "$CKPT_DIR"
which can be dangerous if CKPT_DIR is unset or malformed; before removing,
validate that CKPT_DIR is non-empty, not "/", and points to an existing
directory (and optionally matches an expected path prefix), then perform the
removal using a safe rm invocation; update the script around the CKPT_DIR
removal to check these conditions (referencing the CKPT_DIR variable and the rm
-rf invocation) and only call rm when the validations pass.
In @tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n4g.sh:
- Around line 40-41: The script currently runs rm -rf "$CKPT_DIR" which is
dangerous if CKPT_DIR is empty, unset, or points to a critical path; before
deletion, validate CKPT_DIR: ensure the variable is set and non-empty, resolve
it to an absolute path (realpath or equivalent), verify it exists and is not "/"
or the filesystem root, and optionally assert it is under an expected safe base
directory or matches a known prefix; only proceed to run rm -rf "$CKPT_DIR"
after these checks pass (refer to CKPT_DIR and the rm -rf invocation to locate
the deletion line).
In @tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh:
- Around line 40-41: Validate the CKPT_DIR variable before running rm -rf:
ensure CKPT_DIR is set and non-empty and not a dangerous path (e.g., "/" or
empty) and optionally verify it points to an expected directory (exists and is a
directory) before executing the cleanup; update the cleanup block around the rm
-rf "$CKPT_DIR" invocation in the script to perform these checks and only remove
the directory when they pass.
In @tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh:
- Around line 41-42: The script currently always removes the checkpoint
directory (CKPT_DIR) after running check_metrics.py, which can delete artifacts
needed for debugging if the metrics check fails; update the flow so the cleanup
only runs when the metrics validation succeeds by making the rm -rf "$CKPT_DIR"
conditional on the exit status of check_metrics.py (for example, run the metrics
command and then only execute the CKPT_DIR removal when it succeeds using an
exit-code check or shell && chaining), ensuring that CKPT_DIR is preserved on
failure for investigation.
In @tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh:
- Around line 41-42: Add a safety guard before the unconditional rm -rf call to
ensure CKPT_DIR is non-empty and points to an existing directory; specifically,
check that the CKPT_DIR variable is set (non-empty) and that the path is a
directory, then perform the removal (use the -- delimiter when invoking rm to
handle paths beginning with a dash) to avoid accidental deletion when CKPT_DIR
is unset or empty.
In @tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh:
- Around line 40-42: The cleanup step uses rm -rf on CKPT_DIR without
validation; ensure CKPT_DIR is set, non-empty, not "/" and matches the expected
safe path/prefix before removal, and use the safe rm invocation that protects
against args starting with a dash; update the script around the rm -rf
"$CKPT_DIR" line to perform these checks and only proceed to remove when they
pass, otherwise log an error and skip deletion.
In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh:
- Around line 40-41: The cleanup currently removes "$CKPT_DIR" unconditionally;
change it to run only when the metrics validation succeeds by chaining or
checking the exit status of the metrics check (e.g., run check_metrics.py and
only execute rm -rf "$CKPT_DIR" if that command exits zero). Locate the
invocation of check_metrics.py and replace the unconditional rm -rf "$CKPT_DIR"
with a conditional/&& chain or an if-check that tests the exit code of
check_metrics.py before deleting the checkpoint directory.
In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh:
- Around line 40-41: The cleanup uses an unguarded rm -rf on $CKPT_DIR; add a
safety check before that command to ensure CKPT_DIR is non-empty, not "/", and
points to an existing directory (e.g., test -n "$CKPT_DIR" && [ "$CKPT_DIR" !=
"/" ] && [ -d "$CKPT_DIR" ]), and only then run the removal (use -- with rm).
Update the invocation around the rm -rf "$CKPT_DIR" line to perform these
validations and bail out or log an error if the checks fail.
In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh:
- Around line 39-41: The cleanup uses rm -rf "$CKPT_DIR" which can delete
unintended targets if CKPT_DIR is empty or unset; before removal, validate that
CKPT_DIR is defined and non-empty and optionally ensure it points to an expected
directory pattern (e.g., contains a specific prefix or is under a known base
path) and only then perform the removal; update the script surrounding the rm
-rf invocation (the CKPT_DIR variable usage) to check for non-empty value and
safe path conditions and abort with a clear error if the check fails.
In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh:
- Around line 40-41: The cleanup uses rm -rf "$CKPT_DIR" without safety checks;
ensure CKPT_DIR is non-empty and points to an existing directory before removal
by adding a guard that verifies the variable is set/non-empty and that the path
is a directory (e.g., check both non-empty and -d) and only then call rm -rf on
CKPT_DIR; refer to the CKPT_DIR variable and the rm -rf "$CKPT_DIR" statement
when making the change.
In @tests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh:
- Around line 40-41: Before running the destructive rm -rf on CKPT_DIR, validate
that the CKPT_DIR variable is set, non-empty, not equal to "/" (root), and
points to an existing directory; only then perform the removal and use the safe
rm invocation (with --) to avoid treating paths beginning with - as options.
Update the cleanup block around rm -rf "$CKPT_DIR" to perform these checks and
skip or log an error if the validation fails.
In @tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh:
- Around line 40-41: The checkpoint cleanup (rm -rf "$CKPT_DIR") runs
unconditionally and can delete artifacts when check_metrics.py fails; change the
flow so the cleanup only runs on successful metrics validation by checking the
exit status of the metrics step (e.g., run or inspect check_metrics.py exit code
or use conditional chaining) and move or guard the rm -rf "$CKPT_DIR" behind
that success check (reference the check_metrics.py invocation and CKPT_DIR
cleanup line).
In @tests/test_suites/llm/performance/grpo-qwen3-32b-8n4g-async-1off.sh:
- Around line 36-41: The script currently always removes the checkpoint
directory (rm -rf "$CKPT_DIR") even if the metrics validation command (uv run
tests/check_metrics.py $JSON_METRICS ...) fails; change the flow so that the
cleanup runs only on successful validation by checking the exit status of the
metrics command (or enable strict error handling like set -e) and only executing
rm -rf "$CKPT_DIR" when tests/check_metrics.py exits successfully, otherwise
preserve the directory and propagate a non-zero exit code for post-mortem
debugging.
In @tests/test_suites/llm/performance/grpo-qwen3-32b-8n8g-async-1off.sh:
- Around line 36-41: The script currently always removes the checkpoint
directory after running the metrics check; change it so checkpoint cleanup runs
only if the metrics validation succeeds by testing the exit status of the check
command (the call to tests/check_metrics.py or the "uv run
tests/check_metrics.py $JSON_METRICS ..." invocation) and only executing the rm
-rf "$CKPT_DIR" when that command exits successfully (e.g., use an if on its
exit code or chain with &&) so failed validations preserve CKPT_DIR for
debugging.
In @tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh:
- Around line 42-43: The cleanup step uses rm -rf "$CKPT_DIR" without
validation; add a safety check before that line to ensure CKPT_DIR is set,
non-empty, not "/" or empty string, points to an existing directory, and
(optionally) resides under an expected base/path prefix; only proceed to remove
when these checks pass and otherwise log an error and exit. Reference the
CKPT_DIR variable and the rm -rf line in the script and implement the validation
logic immediately before calling rm to prevent accidental deletion of critical
paths.
In @tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh:
- Around line 42-44: Add a safety guard before the destructive rm -rf call:
validate CKPT_DIR is set and non-empty, is not "/" (or other critical paths),
and is a directory (e.g., check [ -n "$CKPT_DIR" ] && [ "$CKPT_DIR" != "/" ] &&
[ -d "$CKPT_DIR" ]) and only then run rm -rf -- "$CKPT_DIR"; this prevents
accidental deletion when CKPT_DIR is empty, unset, or misconfigured and uses --
to handle paths beginning with -.
In @tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp2.sh:
- Around line 44-45: The cleanup call rm -rf "$CKPT_DIR" is unsafe as-is; add
validation before deletion: check CKPT_DIR is non-empty, is an existing
directory, and not "/" (and ideally not "$HOME" or root paths), optionally
assert it matches an expected prefix (e.g., /tmp or a project-specific base); if
any check fails, log an error and exit without running rm. Use a safe removal
invocation rm -rf -- "$CKPT_DIR" after validations and include informative
logging (e.g., echo or printf) so failures are visible.
In @tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh:
- Around line 38-40: The cleanup call using rm -rf "$CKPT_DIR" is unsafe if
CKPT_DIR is empty or points to a critical path; add a defensive validation
before deletion: verify CKPT_DIR is set and non-empty, ensure it is not "/" (or
other forbidden roots), and optionally confirm it exists and matches the
expected checkpoint directory pattern; if the check fails, log an error and
abort instead of calling rm -rf. Use the CKPT_DIR variable and the cleanup block
around rm -rf "$CKPT_DIR" to implement these checks.
In @tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh:
- Around line 38-40: Before running the dangerous removal, ensure CKPT_DIR is
set, non-empty, and not the root directory; replace the bare rm -rf "$CKPT_DIR"
with a guarded check that verifies CKPT_DIR is defined and not "/" (and
optionally matches an expected base path or prefix), log or error out if the
check fails, and only then perform the recursive delete for CKPT_DIR.
In @tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh:
- Around line 43-45: Add a guard before the existing rm -rf "$CKPT_DIR" to
ensure CKPT_DIR is set, non-empty, and not a dangerous path (e.g., "/" or "")
before deleting; update the script around the rm invocation (the CKPT_DIR
variable and cleanup block) to first check that CKPT_DIR is defined and not
equal to "/" (and optionally that it exists and is a directory) and only then
run the removal, otherwise log/echo a warning and skip deletion.
In @tests/test_suites/llm/sft-qwen2.5-math7b-2n8g-megatron.sh:
- Around line 44-45: The cleanup line using rm -rf "$CKPT_DIR" is unsafe; add
validation before removal: ensure the CKPT_DIR variable is set and non-empty,
verify it points to an existing directory (and not "/" or other critical paths),
optionally resolve it with realpath to prevent symlink tricks, and only then
perform rm -rf; if validation fails, print an error and exit non‑zero. Reference
the CKPT_DIR variable and the rm -rf command when implementing the checks.
In
@tests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh:
- Around line 38-40: The cleanup step uses rm -rf "$CKPT_DIR" without
validation; add a safety guard that verifies CKPT_DIR is set and non-empty (and
optionally not "/" or other critical paths) before running the removal. Update
the script around the rm -rf invocation (the CKPT_DIR variable and the cleanup
block) to check e.g. that CKPT_DIR != "" and not equal to "/" (or perform a
pattern check) and only then execute rm -rf "$CKPT_DIR"; if the check fails,
emit a clear error/warning and skip deletion.
🟠 Major comments (34)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh-38-40 (1)
38-40: Add safety check beforerm -rfto prevent unintended deletion.The
rm -rf "$CKPT_DIR"command could be dangerous if$CKPT_DIRis unset, empty, or misconfigured. While the variable is sourced fromcommon.env, defensive programming suggests adding a validation check before deletion.🔎 Proposed safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/sft-qwen2.5-32b-4n8g-fsdp2tp8sp-actckpt.v3.sh-43-45 (1)
43-45: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command is potentially dangerous ifCKPT_DIRis unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.🔎 Proposed fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/dpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v2.sh-41-43 (1)
41-43: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command is potentially dangerous ifCKPT_DIRis unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.🔎 Proposed fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh-38-40 (1)
38-40: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command is potentially dangerous ifCKPT_DIRis unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.🔎 Proposed fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-seqpack.v1.sh-41-43 (1)
41-43: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command is potentially dangerous ifCKPT_DIRis unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.🔎 Proposed fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-dtensor2tp1.v1.sh-38-40 (1)
38-40: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command is potentially dangerous ifCKPT_DIRis unset or empty. Consider adding a guard to verify the variable is set and non-empty before deletion.🔎 Proposed fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" ]]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/dapo-qwen2.5-7b.sh-42-43 (1)
42-43: Add validation before removing checkpoint directory.While the cleanup is appropriately placed (only on success), the
rm -rfcommand should validate that$CKPT_DIRis set and non-empty to prevent accidental deletion if the variable is malformed.🔎 Proposed safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/grpo-gspo-deepscaler-1.5b-8K.sh-40-41 (1)
40-41: Add validation before removing checkpoint directory.While the cleanup is appropriately placed (only on success), the
rm -rfcommand should validate that$CKPT_DIRis set and non-empty to prevent accidental deletion if the variable is malformed.🔎 Proposed safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-fp8-async-1off.sh-46-47 (1)
46-47: Add validation before removing checkpoint directory.While the cleanup is appropriately placed (only on success), the
rm -rfcommand should validate that$CKPT_DIRis set and non-empty to prevent accidental deletion if the variable is malformed.🔎 Proposed safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/performance/grpo-llama3.1-8b-instruct-2n8g.sh-40-41 (1)
40-41: Add validation before removing checkpoint directory.While the cleanup is appropriately placed (only on success), the
rm -rfcommand should validate that$CKPT_DIRis set and non-empty to prevent accidental deletion if the variable is malformed.🔎 Proposed safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh-41-43 (1)
41-43: Add validation before removing checkpoint directory.While the cleanup is appropriately placed (only on success), the
rm -rfcommand should validate that$CKPT_DIRis set and non-empty to prevent accidental deletion if the variable is malformed.🔎 Proposed safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/performance/grpo-qwen3-30ba3b-24n8g-async-8off.sh-40-41 (1)
40-41: Conditionally execute cleanup only after successful metrics validation.The checkpoint cleanup runs unconditionally after
check_metrics.py. If the metrics check fails, the checkpoint is still deleted, removing debugging artifacts.🔎 Recommended fix: chain cleanup to check success
- uv run tests/check_metrics.py $JSON_METRICS \ - 'median(data["train/token_mult_prob_error"]) < 1.1' \ - 'data["train/token_mult_prob_error"]["10"] < 1.1' - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if uv run tests/check_metrics.py $JSON_METRICS \ + 'median(data["train/token_mult_prob_error"]) < 1.1' \ + 'data["train/token_mult_prob_error"]["10"] < 1.1'; then + # Clean up checkpoint directory after successful run to save space. + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-deepseek-v3-32n4g.sh-46-47 (1)
46-47: Conditionally execute cleanup only after successful metrics validation.The checkpoint cleanup runs regardless of whether the metrics check succeeds. If
check_metrics.pyfails, the checkpoint is still deleted, removing artifacts needed for debugging.🔎 Recommended fix: chain cleanup to check success
- uv run tests/check_metrics.py $JSON_METRICS \ - 'median(data["train/token_mult_prob_error"]) < 1.1' \ - 'data["train/token_mult_prob_error"]["10"] < 1.1' - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if uv run tests/check_metrics.py $JSON_METRICS \ + 'median(data["train/token_mult_prob_error"]) < 1.1' \ + 'data["train/token_mult_prob_error"]["10"] < 1.1'; then + # Clean up checkpoint directory after successful run to save space. + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh-41-42 (1)
41-42: Conditionally execute cleanup only after successful metrics validation.The checkpoint cleanup runs unconditionally after
check_metrics.py. If the metrics check fails but bash continues execution, the checkpoint is deleted, losing debugging artifacts.🔎 Recommended fix: chain cleanup to check success
- uv run tests/check_metrics.py $JSON_METRICS \ - 'median(data["train/token_mult_prob_error"]) < 1.1' \ - 'data["train/token_mult_prob_error"]["10"] < 1.1' - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if uv run tests/check_metrics.py $JSON_METRICS \ + 'median(data["train/token_mult_prob_error"]) < 1.1' \ + 'data["train/token_mult_prob_error"]["10"] < 1.1'; then + # Clean up checkpoint directory after successful run to save space. + rm -rf "$CKPT_DIR" + fiCommittable suggestion skipped: line range outside the PR's diff.
tests/test_suites/llm/performance/grpo-qwen3-32b-4n4g.sh-40-41 (1)
40-41: Conditionally execute cleanup only after successful metrics validation.The checkpoint cleanup currently runs unconditionally after
check_metrics.py. If the metrics check fails but the script continues (bash default behavior withoutset -e), the checkpoint is still deleted, removing valuable debugging artifacts.🔎 Recommended fix: chain cleanup to check success
- uv run tests/check_metrics.py $JSON_METRICS \ - 'median(data["train/token_mult_prob_error"]) < 1.1' \ - 'data["train/token_mult_prob_error"]["10"] < 1.1' - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if uv run tests/check_metrics.py $JSON_METRICS \ + 'median(data["train/token_mult_prob_error"]) < 1.1' \ + 'data["train/token_mult_prob_error"]["10"] < 1.1'; then + # Clean up checkpoint directory after successful run to save space. + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-deepscaler-1.5b-24K.sh-69-70 (1)
69-70: Conditionally execute cleanup and verify placement.Two concerns with the cleanup placement:
Unconditional execution: The cleanup runs after the second metrics check (line 66-67) without verifying that check succeeded. If the check fails, the checkpoint is still deleted.
Outside MAX_STEPS conditional: The cleanup at line 69-70 is outside the
if [[ $(jq ...) -ge $MAX_STEPS ]]block that guards the first metrics check. IfMAX_STEPSis not reached, the first check is skipped, but conversion/eval/second check/cleanup all run. This may be intentional, but seems inconsistent with the pattern in other test suites where cleanup only occurs whenMAX_STEPSis reached.🔎 Recommended fix: make cleanup conditional on all checks passing
uv run tests/check_metrics.py ${RUN_LOG}-24k-metric.json \ - 'data["score"] >= 0.2396' - -# Clean up checkpoint directory after successful run to save space. -rm -rf "$CKPT_DIR" + 'data["score"] >= 0.2396' && \ + rm -rf "$CKPT_DIR" # Clean up checkpoint directory after successful run to save space.If the intent is to also guard against running when
MAX_STEPSis not reached, consider moving the conversion/eval/cleanup inside theMAX_STEPSconditional or adding an explicit check.tests/test_suites/llm/grpo-deepscaler-1.5b-8K.sh-63-64 (1)
63-64: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command is dangerous if$CKPT_DIRis unset or empty. Shell best practices require validating the variable before destructive operations.🔎 Suggested fix with safety check
# Clean up checkpoint directory after successful run to save space. -rm -rf "$CKPT_DIR" +if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" +fitests/test_suites/llm/grpo-qwen2.5-math-1.5b-instruct-1n8g-fsdp2tp1.v3.sh-41-42 (1)
41-42: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command is dangerous if$CKPT_DIRis unset or empty. Shell best practices require validating the variable before destructive operations.🔎 Suggested fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi fitests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n4g.sh-40-41 (1)
40-41: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command is dangerous if$CKPT_DIRis unset or empty. Shell best practices require validating the variable before destructive operations.🔎 Suggested fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi fitests/test_suites/llm/performance/grpo-qwen3-30ba3b-8n4g-async-1off.sh-40-41 (1)
40-41: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command is dangerous if$CKPT_DIRis unset or empty. Shell best practices require validating the variable before destructive operations.🔎 Suggested fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi fitests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.sh-42-43 (1)
42-43: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command is dangerous if$CKPT_DIRis unset or empty. Shell best practices require validating the variable before destructive operations.🔎 Suggested fix with safety check
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi fitests/test_suites/llm/grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long.sh-39-41 (1)
39-41: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.🔎 Recommended safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/performance/grpo-deepseek-v3-32n8g.sh-45-47 (1)
45-47: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.🔎 Recommended safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g-async-1off.sh-39-41 (1)
39-41: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.🔎 Recommended safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh-42-44 (1)
42-44: Add safety check beforerm -rfto prevent accidental deletion.The
rm -rf "$CKPT_DIR"command can be dangerous ifCKPT_DIRis empty, unset, or misconfigured. Add a validation check to ensure the variable is set and non-empty before deletion.🔎 Recommended safety check
+ # Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then + rm -rf "$CKPT_DIR" + fi - - # Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR"tests/test_suites/llm/grpo-llama3.2-1b-instruct-1n8g-fsdp2tp1.v3.sh-41-42 (1)
41-42: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command could be dangerous ifCKPT_DIRis unset or empty.🔎 Proposed safety check
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-qwen3-32b-4n8g.sh-40-41 (1)
40-41: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command could be dangerous ifCKPT_DIRis unset or empty.🔎 Proposed safety check
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3.sh-40-41 (1)
40-41: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command could be dangerous ifCKPT_DIRis unset or empty.🔎 Proposed safety check
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.sh-42-43 (1)
42-43: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command could be dangerous ifCKPT_DIRis unset or empty. Consider adding a validation check to prevent accidental deletion.🔎 Proposed safety check
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-deepseek-v3-64n8g-async-1off.sh-46-47 (1)
46-47: Add safety check before removing checkpoint directory.The
rm -rf "$CKPT_DIR"command could be dangerous ifCKPT_DIRis unset or empty. Consider adding a validation check to prevent accidental deletion.🔎 Proposed safety check
# Clean up checkpoint directory after successful run to save space. + if [[ -n "$CKPT_DIR" && "$CKPT_DIR" != "/" ]]; then - rm -rf "$CKPT_DIR" + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e.sh-40-41 (1)
40-41: Add safety guard beforerm -rfto prevent unintended deletion.If
$CKPT_DIRis unset or empty,rm -rfcould behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.🔎 Proposed fix
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/grpo-gemma3-1b-it-1n8g-fsdp2tp1.sh-41-42 (1)
41-42: Add safety guard beforerm -rfto prevent unintended deletion.If
$CKPT_DIRis unset or empty,rm -rfcould behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.🔎 Proposed fix
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-qwen3-30ba3b-4n8g.sh-40-41 (1)
40-41: Add safety guard beforerm -rfto prevent unintended deletion.If
$CKPT_DIRis unset or empty,rm -rfcould behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.🔎 Proposed fix
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fitests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh-41-42 (1)
41-42: Add safety guard beforerm -rfto prevent unintended deletion.If
$CKPT_DIRis unset or empty,rm -rfcould behave unexpectedly. Add a guard to verify the variable is non-empty and the directory exists.🔎 Proposed fix
# Clean up checkpoint directory after successful run to save space. - rm -rf "$CKPT_DIR" + if [ -n "$CKPT_DIR" ] && [ -d "$CKPT_DIR" ]; then + rm -rf "$CKPT_DIR" + fiCommittable suggestion skipped: line range outside the PR's diff.
🟡 Minor comments (3)
tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh-22-22 (1)
22-22: Verify that the wandb project name change is intentional.The wandb project name has changed from
nemo-rl-distillationtonemo-rl. This change is not mentioned in the PR objectives and could affect metric tracking and organization in Weights & Biases. Please confirm this is an intentional change and not an accidental modification.tests/test_suites/llm/grpo-helpsteer3-llama-3.3-nemotron-super-49b-v1.5-8n8g-fsdp2tp8cp4.sh.disabled-37-40 (1)
37-40: Missing cleanup step - inconsistent with PR pattern.Unlike other test suites in this PR that add checkpoint cleanup (
rm -rf "$CKPT_DIR") after successful metrics validation, this file is missing that addition. For consistency, consider adding the same cleanup step after line 38, even though the file is currently disabled.🔎 Suggested addition for consistency
uv run tests/check_metrics.py $JSON_METRICS \ 'median(data["train/token_mult_prob_error"]) < 1.1' \ "data['train/token_mult_prob_error']['$MAX_STEPS'] < 1.1" + + # Clean up checkpoint directory after successful run to save space. + rm -rf "$CKPT_DIR" fitests/test_suites/llm/performance/dapo-deepseek-v3-64n8g.sh-47-49 (1)
47-49: Remove unnecessary checkpoint cleanup or configure it to use the actual checkpoint directory.The
rm -rf "$CKPT_DIR"command removes an empty directory that is never populated. The config file specifiescheckpoint_dir: results/grpo_dapomath17k_dsv3_megatron(line 30 of dapo-deepseek-v3-64n8g.yaml), but the test script does not pass this to the training entrypoint, so checkpoints are stored in the config path instead. TheCKPT_DIRvariable ($EXP_DIR/ckpts) is created bycommon.envbut remains unused.Either remove the cleanup step or configure the training call to use
CKPT_DIRby adding a checkpoint override to the command line arguments.
…ghtlies (#1722) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
…ghtlies (NVIDIA-NeMo#1722) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
…ghtlies (NVIDIA-NeMo#1722) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…ghtlies (NVIDIA-NeMo#1722) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…ghtlies (#1722) Signed-off-by: Terry Kong <terryk@nvidia.com>
…ghtlies (#1722) Signed-off-by: Terry Kong <terryk@nvidia.com>
…ghtlies (#1722) Signed-off-by: Terry Kong <terryk@nvidia.com>
What does this PR do ?
Also contains a change to just clean up ckpt dirs from the nightlies
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.