cp: `feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) (1648)` into `r0.5.0` by chtruong814 · Pull Request #1697 · NVIDIA-NeMo/RL

chtruong814 · 2025-12-24T06:55:37Z

beep boop [🤖]: Hi @RayenTian 👋,

we've cherry picked #1648 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

New Features
- Added new supervised fine-tuning configurations for the NanoV3 model, including standard and LoRA-optimized variants for model fine-tuning experiments.
Tests
- Introduced new test scripts for validating NanoV3 configurations in the nightly test suite.
- Updated compute resource threshold validation to accommodate expanded test suite requirements.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…A) (#1648) Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2025-12-24T07:00:36Z

📝 Walkthrough

Walkthrough

This change adds two new SFT experiment configurations for the Nemotron-3 Nano 30B model—one standard and one with LoRA fine-tuning enabled. Corresponding test scripts are introduced and registered in the nightly test suite. A nightly compute threshold test is updated to reflect new GPU hour limits.

Changes

Cohort / File(s)	Summary
SFT Configuration Files `examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml`, `examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml`	New YAML configs for Nemotron Nano 30B SFT experiments with FSDP2 on 2 nodes (8 GPUs each). Second variant enables LoRA with dim=256, alpha=512. Both set max_num_steps=100, train_global_batch_size=16, max_total_sequence_length=2048.
Test Scripts `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh`	New shell scripts execute SFT experiments via `uv run`, convert TensorBoard logs to JSON, and conditionally validate loss and timing metrics if max steps reached.
Nightly Test Suite Registration `tests/test_suites/nightly.txt`	Registers two new test scripts in nightly suite (appears in multiple sections).
Test Threshold Update `tests/unit/test_recipes_and_test_suites.py`	Renamed test function and updated GPU hour threshold from 1130 to 1140 hours. Refactored implementation to execute nightly suite via subprocess, capture output, parse GPU hours metric, and enforce success.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) #1648: Makes identical code-level changes—adds the same Nemotron-3 Nano 30B SFT config files, test scripts (FSDP2 and LoRA variants), and nightly test suite updates.

Suggested labels

CI:L1, r0.5.0

Suggested reviewers

joyang-nv
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: adding Nemotron-3 Nano 30B A3B BF16 SFT nightly tests with FSDP2 and LoRA configurations, which aligns with the changeset.
Test Results For Major Changes	✅ Passed	PR adds minor test configurations for existing model variant with defined performance validation thresholds; modest GPU hour threshold increase proportional to test additions.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-1648-r0.5.0

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd3b423 and 4fb842a.

📒 Files selected for processing (6)

examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/nightly.txt
tests/unit/test_recipes_and_test_suites.py

🧰 Additional context used

📓 Path-based instructions (8)

examples/configs/recipes/**/*.yaml