feat: Add bisecting tooling for nightly test regressions by terrykong · Pull Request #1223 · NVIDIA-NeMo/RL

terrykong · 2025-09-29T02:58:56Z

Follow up to #1215 to enable bisecting nightly tests.

example 1 sft failures

Here's an example invocation that found one regression with tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh

rsync -ahP --delete tools/ tools.bisect/  # This copies bisect utilities outside of VCS so we always run the latest copy
TEST_CASE=tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh

HF_HOME=... \
HF_DATASETS_CACHE=... \
CONTAINER=... \
MOUNTS=... \
ACCOUNT=... \
PARTITION=... \
SED_CLAUSES=$(cat <<'SED'
s@'mean(data\["timing/train/total_step_time@#'mean(data["timing/train/total_step_time@
/ray\/node\.0\.gpu\.0\.mem_gb/d
SED
) \
EXTRA_SCRIPT_ARGS="++data.num_workers=1" \
  GOOD=$(git log --format="%h" --diff-filter=A -- $TEST_CASE) \
  BAD=HEAD \
  tools.bisect/bisect-script.sh tools.bisect/launch-bisect-helper.sh $TEST_CASE 2>&1 | tee -a bisect.log

https://wandb.ai/nvidia/nemo-rl?nw=fnhia71y43d

Which produces this git bisect log

# bad: [5b9ab15799c35428c557ab6f8687ec461b69383e] fix all logs glob
# good: [ac7469ffabf6eebe0b014b3baa04551474a3a66b] test: Add Megatron tests (#713)
git bisect start '5b9ab15799c35428c557ab6f8687ec461b69383e' 'ac7469ff'
# good: [5a9f7acc59ed70e6eb52dd065a55ec015c895204] feat: Expose async vLLM engine as HTTP server (#1110)
git bisect good 5a9f7acc59ed70e6eb52dd065a55ec015c895204
# good: [3a1ca3fee69ac139d2b68fef89b749200e6daa00] perf: Remove empty_cache for performance optimization (#1071)
git bisect good 3a1ca3fee69ac139d2b68fef89b749200e6daa00
# good: [ef60b3341c2ea1b6c3d046f2ea2e381e4535e54c] ci: Run nightly Github tests (#1172)
git bisect good ef60b3341c2ea1b6c3d046f2ea2e381e4535e54c
# good: [42aa41b6617b355865038ed24511118d4fb1c0d6] feat: add async RL support (#1098)
git bisect good 42aa41b6617b355865038ed24511118d4fb1c0d6
# bad: [c01f9d7ceb53a7f0246ae53c09ccb054cdcbcdd7] ci: Add status badge and prevent merging if no tests ran (#1192)
git bisect bad c01f9d7ceb53a7f0246ae53c09ccb054cdcbcdd7
# bad: [e22a340b515f2814b6b19e8d7805c94c15a46b6f] docs: Restructure README with backend-specific quick start and setup guides (#1091)
git bisect bad e22a340b515f2814b6b19e8d7805c94c15a46b6f
# bad: [051c2f761a0a4606517bfe3bff84ddcc9b3291ce] fix: Add check for world size and parallelism enabled (#1190)
git bisect bad 051c2f761a0a4606517bfe3bff84ddcc9b3291ce
# bad: [64ee0d030246d0ea04fc49e5a1513fe84082ee70] feat: support chat_template_kwargs in tokenizer config (#1165)
git bisect bad 64ee0d030246d0ea04fc49e5a1513fe84082ee70
# bad: [cde2acd6e4d9a9514ee4646f384b8aba3bcc8b62] perf: Add a field in SFT data config to modify num_workers for loading data (#1143)
git bisect bad cde2acd6e4d9a9514ee4646f384b8aba3bcc8b62
# first bad commit: [cde2acd6e4d9a9514ee4646f384b8aba3bcc8b62] perf: Add a field in SFT data config to modify num_workers for loading data (#1143)

which turned out to be a regression because the metric check was too noisy. In particular 'data["train/loss"]["250"] < 0.5' \ failed once we started using num_workers=1 which changed the determinism of the run.

example 2 (fp8 bisect)

rm -rf code_snapshots_bisect/
rsync -ahP --delete tools/ tools.bisect/  # This copies bisect utilities outside of VCS so we always run the latest copy
TEST_CASE=tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.sh

HF_HOME=... \
HF_DATASETS_CACHE=... \
CONTAINER=... \
MOUNTS=... \
ACCOUNT=... \
PARTITION=... \
EXTRA_SCRIPT_ARGS="++data.num_workers=1 cluster.num_nodes=1" \
  GOOD=$(git log --format="%h" --diff-filter=A -- $TEST_CASE) \
  BAD=f521459c5848b0f7c804e8df2551242d96b48369 \
  tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh $TEST_CASE 2>&1 | tee -a bisect.log

found that even the initial commit failed

Summary by CodeRabbit

Release Notes

New Features
- Added git bisect workflow automation to efficiently identify failing tests across commits
- Introduced customizable table width option for metrics display
Documentation
- Comprehensive guide for bisecting failing tests with step-by-step procedures and recovery workflows
Chores
- Enhanced job submission tracking and metrics aggregation for improved test run monitoring

_{✏️ Tip: You can customize this high-level summary in your review settings.}

terrykong · 2025-11-10T23:00:32Z

@CodeRabbit title

coderabbitai · 2025-11-10T23:00:55Z

Tip

For best results, initiate chat on the files or code changes.

@terrykong Based on the PR content, I suggest the following title:

feat: Add bisecting tooling for nightly test regressions

Or alternatively:

feat: Enable git bisect automation for nightly test failures

Both titles clearly convey that this PR:

Adds new functionality (feat:)
Introduces bisecting capabilities
Targets nightly test regressions/failures

The first option is more concise, while the second is more specific about the git bisect automation aspect. Let me know if you'd like me to adjust the wording!

coderabbitai · 2025-12-18T01:30:01Z

📝 Walkthrough

Walkthrough

The PR introduces comprehensive git bisect testing infrastructure, including documentation on bisecting failing tests, a new Bash script for orchestrating bisect workflows with submodule handling, a metrics table width control parameter, and enhanced launcher capabilities with job tracking and post-run workflow analysis.

Changes

Cohort / File(s)	Summary
Documentation `docs/testing.md`	Added extensive bisect testing workflow documentation covering stable bisect procedures, submodule handling, nightly bisects, environment variables, exit codes, and recovery mechanisms.
Test Infrastructure `tests/check_metrics.py`	Added optional `--table-width` CLI argument to control Rich Table rendering with fixed min_width of 150 and dynamic column auto-sizing.
Bisect Orchestration `tools/bisect-run.sh`	New executable Bash script implementing git bisect workflow orchestration, including clean working tree validation, submodule preparation (unshallow, fetch, validate), bisect initialization/execution, timestamped logging, replay support via `BISECT_REPLAY_LOG`, and interrupt handling with resume hints.
Launcher Enhancement `tools/launch`	Enhanced launcher with Slurm job tracking via `JOB_IDS` array, replacement of `RELEASE_ARGS` with `EXTRA_SCRIPT_ARGS` mechanism, new `WATCH` workflow for polling job states and aggregating per-experiment metrics, and post-watch analysis that evaluates test results and exits based on failure status.
Bisect Test Launcher `tools/launch-bisect.sh`	New shell script orchestrating bisect test runs by validating environment variables, optionally applying `SED_CLAUSES` transformations, configuring per-commit isolation, enabling `WATCH` mode, and resetting submodules on exit to prevent merge conflicts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

tools/launch modifications: Job ID extraction and tracking logic, WATCH workflow state management, and argument aggregation mechanism require careful validation of integration with existing submission/execution flow
tools/bisect-run.sh: Submodule preparation and validation edge cases, replay/resume logic paths, and interrupt trap handling need thorough verification
Cross-tool integration: Interaction between tools/launch, tools/launch-bisect.sh, and tools/bisect-run.sh should be validated for correct environment propagation and error handling
metrics table width: Verify Rich Table rendering behavior across various width values and edge cases

Possibly related PRs

test: add bisect-script.sh to help bisect CI tests #1215: Adds overlapping git-bisect helper shell script functionality under tools/ with similar clean-tree checks, traps, bisect run/reset, logging/replay, and BISECT_NO_RESET support.

Suggested labels

CI:L0

Suggested reviewers

chtruong814
ahmadki

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: Add bisecting tooling for nightly test regressions' directly and clearly summarizes the main changes—new bisecting tools for debugging nightly test failures—matching the changeset's primary focus.
Test Results For Major Changes	✅ Passed	PR introduces testing infrastructure tooling with documented example outputs, W&B links, and sample bisect logs demonstrating practical validation of the new tools.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/slurm-bisect

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tools/launch (1)

265-279: Consider replacing deprecated fgrep with grep -F.

fgrep is deprecated in favor of grep -F per POSIX recommendations. While it still works, updating would future-proof the script.

🔎 Suggested diff

-        if ! fgrep -A10000 'Metric Checks' $logs &>/dev/null; then
+        if ! grep -F -A10000 'Metric Checks' $logs &>/dev/null; then
             echo "[GENERAL FAIL]   $experiment_name"
             # Print the logs to inspect
             ls -lah $logs
             any_fail=1
-        elif fgrep -A10000 'Metric Checks' $logs | fgrep FAIL &>/dev/null; then
+        elif grep -F -A10000 'Metric Checks' $logs | grep -F FAIL &>/dev/null; then
             echo "[METRIC FAIL]    $experiment_name"
             # Print the metrics to inspect
-            fgrep -A10000 -H 'Metric Checks' $logs
+            grep -F -A10000 -H 'Metric Checks' $logs
             any_fail=1
         else
             echo "[METRIC PASS]    $experiment_name"
             # Print the metrics to inspect
-            fgrep -A10000 -H 'Metric Checks' $logs
+            grep -F -A10000 -H 'Metric Checks' $logs
         fi

tools/launch-bisect.sh (1)

61-63: Address SC2155: Declare and assign separately to avoid masking return values.

If git log fails, the export will still succeed and mask the error. Separate declaration from assignment for safer error handling.

🔎 Suggested fix

-export EXTRA_ENV="${EXTRA_ENV:-} NRL_FORCE_REBUILD_VENVS=true NRL_MEGATRON_CHECKPOINT_DIR=$PROJECT_ROOT/code_snapshots_bisect/$(basename $TEST_CASE .sh)/mcore_ckpt_dir_$(git log -1 --format='%h-%f' HEAD)"
-# Use a different code snapshot directory name for each commit otherwise the same named test will run
-export CODE_SNAPSHOT_DIRNAME=code_snapshots_bisect/$(git log -1 --format='%h-%f' HEAD)
+# Use a different code snapshot directory name for each commit otherwise the same named test will run
+git_commit_slug=$(git log -1 --format='%h-%f' HEAD)
+export EXTRA_ENV="${EXTRA_ENV:-} NRL_FORCE_REBUILD_VENVS=true NRL_MEGATRON_CHECKPOINT_DIR=$PROJECT_ROOT/code_snapshots_bisect/$(basename $TEST_CASE .sh)/mcore_ckpt_dir_${git_commit_slug}"
+export CODE_SNAPSHOT_DIRNAME="code_snapshots_bisect/${git_commit_slug}"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 441f745 and 94781a7.

📒 Files selected for processing (5)

docs/testing.md (1 hunks)
tests/check_metrics.py (3 hunks)
tools/bisect-run.sh (1 hunks)
tools/launch (4 hunks)
tools/launch-bisect.sh (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.sh