feat: Add bisecting tooling for nightly test regressions#1223
Conversation
89a0b31 to
6c79364
Compare
|
@CodeRabbit title |
|
Tip For best results, initiate chat on the files or code changes. @terrykong Based on the PR content, I suggest the following title: feat: Add bisecting tooling for nightly test regressions Or alternatively: feat: Enable git bisect automation for nightly test failures Both titles clearly convey that this PR:
The first option is more concise, while the second is more specific about the git bisect automation aspect. Let me know if you'd like me to adjust the wording! |
418b671 to
94781a7
Compare
📝 WalkthroughWalkthroughThe PR introduces comprehensive git bisect testing infrastructure, including documentation on bisecting failing tests, a new Bash script for orchestrating bisect workflows with submodule handling, a metrics table width control parameter, and enhanced launcher capabilities with job tracking and post-run workflow analysis. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tools/launch (1)
265-279: Consider replacing deprecatedfgrepwithgrep -F.
fgrepis deprecated in favor ofgrep -Fper POSIX recommendations. While it still works, updating would future-proof the script.🔎 Suggested diff
- if ! fgrep -A10000 'Metric Checks' $logs &>/dev/null; then + if ! grep -F -A10000 'Metric Checks' $logs &>/dev/null; then echo "[GENERAL FAIL] $experiment_name" # Print the logs to inspect ls -lah $logs any_fail=1 - elif fgrep -A10000 'Metric Checks' $logs | fgrep FAIL &>/dev/null; then + elif grep -F -A10000 'Metric Checks' $logs | grep -F FAIL &>/dev/null; then echo "[METRIC FAIL] $experiment_name" # Print the metrics to inspect - fgrep -A10000 -H 'Metric Checks' $logs + grep -F -A10000 -H 'Metric Checks' $logs any_fail=1 else echo "[METRIC PASS] $experiment_name" # Print the metrics to inspect - fgrep -A10000 -H 'Metric Checks' $logs + grep -F -A10000 -H 'Metric Checks' $logs fitools/launch-bisect.sh (1)
61-63: Address SC2155: Declare and assign separately to avoid masking return values.If
git logfails, the export will still succeed and mask the error. Separate declaration from assignment for safer error handling.🔎 Suggested fix
-export EXTRA_ENV="${EXTRA_ENV:-} NRL_FORCE_REBUILD_VENVS=true NRL_MEGATRON_CHECKPOINT_DIR=$PROJECT_ROOT/code_snapshots_bisect/$(basename $TEST_CASE .sh)/mcore_ckpt_dir_$(git log -1 --format='%h-%f' HEAD)" -# Use a different code snapshot directory name for each commit otherwise the same named test will run -export CODE_SNAPSHOT_DIRNAME=code_snapshots_bisect/$(git log -1 --format='%h-%f' HEAD) +# Use a different code snapshot directory name for each commit otherwise the same named test will run +git_commit_slug=$(git log -1 --format='%h-%f' HEAD) +export EXTRA_ENV="${EXTRA_ENV:-} NRL_FORCE_REBUILD_VENVS=true NRL_MEGATRON_CHECKPOINT_DIR=$PROJECT_ROOT/code_snapshots_bisect/$(basename $TEST_CASE .sh)/mcore_ckpt_dir_${git_commit_slug}" +export CODE_SNAPSHOT_DIRNAME="code_snapshots_bisect/${git_commit_slug}"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/testing.md(1 hunks)tests/check_metrics.py(3 hunks)tools/bisect-run.sh(1 hunks)tools/launch(4 hunks)tools/launch-bisect.sh(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
tools/launch-bisect.shtools/bisect-run.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
tools/launch-bisect.shdocs/testing.mdtools/bisect-run.shtests/check_metrics.pytools/launch
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tools/launch-bisect.shtools/bisect-run.shtests/check_metrics.py
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section
Files:
docs/testing.md
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
tests/check_metrics.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: For PRs with major changes (new features, breaking changes, or significant refactoring), verify that the PR description includes test results or testing information
📚 Learning: 2025-11-24T17:24:47.707Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2025-11-24T17:24:47.707Z
Learning: For PRs with major changes (new features, breaking changes, or significant refactoring), verify that the PR description includes test results or testing information
Applied to files:
docs/testing.md
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to **/*.sh : Use uv run instead of python to execute scripts
Applied to files:
tests/check_metrics.py
🪛 Shellcheck (0.11.0)
tools/launch-bisect.sh
[warning] 61-61: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 63-63: Declare and assign separately to avoid masking return values.
(SC2155)
tools/bisect-run.sh
[error] 218-218: Argument mixes string and array. Use * or separate argument.
(SC2145)
[error] 380-380: Argument mixes string and array. Use * or separate argument.
(SC2145)
🔇 Additional comments (13)
tests/check_metrics.py (2)
1-6: LGTM! Proper uv script header added.The shebang with
uv run --script -qand the inline dependencies block follow the coding guidelines for usinguv runinstead ofpythonto execute scripts.
148-153: LGTM! Table width control is well-implemented.The optional
--table-widthargument withdefault=Noneallows auto-sizing when unspecified while enabling explicit width control when needed. Themin_width=150ensures reasonable table dimensions even without explicit width.Also applies to: 181-181
tools/launch (3)
112-112: LGTM! Clean array-based argument handling.The
JOB_IDSarray andEXTRA_SCRIPT_ARGSarray provide cleaner argument aggregation and enable tracking submitted jobs for the new WATCH workflow. The metadata injection (wandb name, git metadata, container provenance) is well-structured.Also applies to: 141-153
180-187: LGTM! Robust job ID extraction with error handling.The regex-based job ID parsing from
sbatchoutput and the error handling for parse failures ensures reliable job tracking.
193-234: LGTM! Well-designed WATCH workflow with interactive polling.The WATCH implementation handles both interactive (TTY with early refresh on Enter) and non-interactive modes gracefully. The job tracking logic correctly identifies finished jobs and provides clear progress updates.
docs/testing.md (2)
251-253: Verify MyST directive compatibility with your documentation system.The
::::{note}and:::{important}syntax are MyST Markdown directives. Ensure your documentation toolchain (Sphinx with myst-parser, MkDocs with appropriate plugins, etc.) supports these directives, otherwise they'll render as raw text.Also applies to: 278-280
182-324: LGTM! Comprehensive bisect workflow documentation.The documentation thoroughly covers the bisect workflow including prerequisites (rsync tools/), usage patterns, environment variables, submodule handling, and recovery procedures. The tips for handling orphaned submodule commits and the
BISECT_REPLAY_LOGresume mechanism are particularly valuable.tools/launch-bisect.sh (2)
40-55: LGTM! Safe SED_CLAUSES handling with restoration trap.The trap to restore the TEST_CASE from HEAD ensures modifications don't persist after the run. The warning for untracked files is appropriate.
70-83: LGTM! Essential submodule cleanup prevents bisect failures.The recursive
git reset --hard && git clean -fdxon submodules prevents the "Entry not uptodate. Cannot merge" errors documented in the comments. Preserving and returning the launcher's exit code ensures correct bisect classification.tools/bisect-run.sh (4)
246-265: LGTM! Thorough submodule preparation.Unshallowing submodules, fetching all branches, and fetching GitHub PR refs ensures any submodule commit can be checked out during bisect. The clean submodule check prevents unexpected failures mid-bisect.
267-300: LGTM! Robust git config validation for submodule recursion.The validation of
submodule.recurseandfetch.recurseSubmoduleswith helpful error messages ensures the bisect will work correctly across commits that change submodule references.
319-361: LGTM! Pre-verification of GOOD commit is a valuable safety check.Verifying the GOOD commit actually passes before starting the bisect prevents wasted time on misconfigured baselines. The
SKIP_GOOD_CHECKescape hatch is useful for resuming interrupted sessions.
180-207: LGTM! Well-designed log saving and cleanup handlers.The
save_bisect_logfunction with timestamped filenames and the interrupt handler that provides resume instructions create a robust recovery mechanism. TheBISECT_NO_RESEToption for debugging is a nice touch.Also applies to: 209-238
94781a7 to
4e71a29
Compare
jgerh
left a comment
There was a problem hiding this comment.
Completed the tech pubs review of testing.md and provided a few copyedits.
db51920 to
0128d4b
Compare
Signed-off-by: Terry Kong <terryk@nvidia.com> typos Signed-off-by: Terry Kong <terryk@nvidia.com>
0128d4b to
5e4a96c
Compare
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
…#1223) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…#1223) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
…#1223) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Aniket Singh Yadav <singhyadavaniket43@gmail.com>
Follow up to #1215 to enable bisecting nightly tests.
example 1 sft failures
Here's an example invocation that found one regression with
tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.shhttps://wandb.ai/nvidia/nemo-rl?nw=fnhia71y43d
Which produces this
git bisect logwhich turned out to be a regression because the metric check was too noisy. In particular
'data["train/loss"]["250"] < 0.5' \failed once we started usingnum_workers=1which changed the determinism of the run.example 2 (fp8 bisect)
found that even the initial commit failed
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.