Skip to content

cp: feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) (1648) into r0.5.0#1697

Merged
yuki-97 merged 1 commit intor0.5.0from
cherry-pick-1648-r0.5.0
Dec 24, 2025
Merged

cp: feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) (1648) into r0.5.0#1697
yuki-97 merged 1 commit intor0.5.0from
cherry-pick-1648-r0.5.0

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Dec 24, 2025

beep boop [🤖]: Hi @RayenTian 👋,

we've cherry picked #1648 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features

    • Added new supervised fine-tuning configurations for the NanoV3 model, including standard and LoRA-optimized variants for model fine-tuning experiments.
  • Tests

    • Introduced new test scripts for validating NanoV3 configurations in the nightly test suite.
    • Updated compute resource threshold validation to accommodate expanded test suite requirements.

✏️ Tip: You can customize this high-level summary in your review settings.

…A) (#1648)

Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@chtruong814 chtruong814 requested a review from a team as a code owner December 24, 2025 06:55
@chtruong814 chtruong814 requested a review from a team as a code owner December 24, 2025 06:55
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 24, 2025

📝 Walkthrough

Walkthrough

This change adds two new SFT experiment configurations for the Nemotron-3 Nano 30B model—one standard and one with LoRA fine-tuning enabled. Corresponding test scripts are introduced and registered in the nightly test suite. A nightly compute threshold test is updated to reflect new GPU hour limits.

Changes

Cohort / File(s) Summary
SFT Configuration Files
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml, examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
New YAML configs for Nemotron Nano 30B SFT experiments with FSDP2 on 2 nodes (8 GPUs each). Second variant enables LoRA with dim=256, alpha=512. Both set max_num_steps=100, train_global_batch_size=16, max_total_sequence_length=2048.
Test Scripts
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh, tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
New shell scripts execute SFT experiments via uv run, convert TensorBoard logs to JSON, and conditionally validate loss and timing metrics if max steps reached.
Nightly Test Suite Registration
tests/test_suites/nightly.txt
Registers two new test scripts in nightly suite (appears in multiple sections).
Test Threshold Update
tests/unit/test_recipes_and_test_suites.py
Renamed test function and updated GPU hour threshold from 1130 to 1140 hours. Refactored implementation to execute nightly suite via subprocess, capture output, parse GPU hours metric, and enforce success.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

CI:L1, r0.5.0

Suggested reviewers

  • joyang-nv
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding Nemotron-3 Nano 30B A3B BF16 SFT nightly tests with FSDP2 and LoRA configurations, which aligns with the changeset.
Test Results For Major Changes ✅ Passed PR adds minor test configurations for existing model variant with defined performance validation thresholds; modest GPU hour threshold increase proportional to test additions.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1648-r0.5.0

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd3b423 and 4fb842a.

📒 Files selected for processing (6)
  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/nightly.txt
  • tests/unit/test_recipes_and_test_suites.py
🧰 Additional context used
📓 Path-based instructions (8)
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
examples/configs/recipes/llm/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Recipe YAML files should follow the naming pattern: --ng-[-modifiers][-long][.vN].yaml for LLM recipes

Files:

  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/nightly.txt
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/unit/test_recipes_and_test_suites.py
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/unit/test_recipes_and_test_suites.py
tests/test_suites/nightly.txt

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Files:

  • tests/test_suites/nightly.txt
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • tests/unit/test_recipes_and_test_suites.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/nightly.txt : When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes

Applied to files:

  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
📚 Learning: 2025-09-24T18:36:06.287Z
Learnt from: terrykong
Repo: NVIDIA-NeMo/RL PR: 1024
File: examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.yaml:1-1
Timestamp: 2025-09-24T18:36:06.287Z
Learning: In the NVIDIA NeMo RL repository, when working with Hydra config defaults, the scalar string format (defaults: ../../dpo.yaml) is acceptable and preferred over the list format, even though Hydra typically expects defaults to be a list.

Applied to files:

  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
  • examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/nightly.txt
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Applied to files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain

Applied to files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/nightly.txt : When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt

Applied to files:

  • tests/test_suites/nightly.txt
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/unit/test_recipes_and_test_suites.py
🧬 Code graph analysis (1)
tests/unit/test_recipes_and_test_suites.py (1)
tests/unit/conftest.py (1)
  • tracker (265-296)
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 27-27: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[error] 27-27: Double quote array expansions to avoid re-splitting elements.

(SC2068)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (6)
tests/test_suites/nightly.txt (1)

93-96: LGTM! Nightly test entries properly added.

The new test scripts for Nemotron 3 Nano 30B A3B BF16 (standard and LoRA variants) are correctly registered following the established pattern. As per coding guidelines and learnings, this is the proper way to add nightly tests for a new model.

examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml (1)

1-20: LGTM! Configuration follows established patterns.

The SFT configuration for Nemotron 3 Nano 30B A3B BF16 is well-structured:

  • Filename follows the naming pattern specified in coding guidelines
  • Uses scalar string format for defaults (acceptable per learnings)
  • Appropriate settings for a 30B model test (100 steps, batch size 16, sequence length 2048)
  • Logging and cluster configuration properly specified
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml (1)

1-26: LGTM! LoRA configuration properly extends base config.

The LoRA-enabled variant correctly adds the LoRA configuration block while maintaining consistency with the base config. The LoRA hyperparameters (dim: 256, alpha: 512) are reasonable, and logger names appropriately include the "-lora" suffix.

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)

1-38: LGTM! Test script follows established patterns.

The test script correctly implements the standard test infrastructure pattern:

  • Configuration variables (NUM_NODES, NUM_RUNS, NUM_MINUTES) are properly set for external launch tooling consumption
  • Uses uv run as specified in coding guidelines
  • Appropriate metrics validation with loss threshold < 1.98 and timing threshold < 15 seconds
  • Static analysis warnings (SC2034, SC2164, SC2068) are expected false positives per established repository patterns

Based on learnings, test scripts follow this standard pattern and the warnings can be safely ignored.

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh (1)

1-38: LGTM! LoRA test script properly configured with appropriate thresholds.

The LoRA variant follows the same pattern as the base test script with appropriately relaxed thresholds:

  • Loss threshold: < 2.03 (vs. 1.98 for non-LoRA)
  • Timing threshold: < 18 seconds (vs. 15 for non-LoRA)

The more generous timing buffer is appropriate given LoRA's potentially different performance characteristics. Static analysis warnings are expected false positives per established repository patterns.

Based on learnings, the test infrastructure variables and patterns are correct.

tests/unit/test_recipes_and_test_suites.py (1)

183-218: LGTM! GPU hour threshold increase is justified.

The test function rename and threshold update from 1130 to 1140 GPU hours is appropriate given the addition of two new Nemotron 3 Nano tests. Expected GPU hour consumption:

  • 2 new tests × (2 nodes × 8 GPUs × 0.25 hours) ≈ 8 GPU hours
  • Threshold increase of 10 hours provides reasonable buffer

The refactored test implementation with subprocess execution and explicit result handling improves robustness.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@RayenTian RayenTian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

@yuki-97 yuki-97 enabled auto-merge (squash) December 24, 2025 08:43
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Dec 24, 2025
@yuki-97 yuki-97 merged commit e883ac4 into r0.5.0 Dec 24, 2025
79 of 88 checks passed
@yuki-97 yuki-97 deleted the cherry-pick-1648-r0.5.0 branch December 24, 2025 14:43
avenkateshha pushed a commit to avenkateshha/RL that referenced this pull request Apr 10, 2026
… +LoRA) (1648)` into `r0.5.0` (NVIDIA-NeMo#1697)

Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Rayen <ruit@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants