Skip to content

feat: Fp8 moe rollout#1446

Merged
terrykong merged 24 commits intoNVIDIA-NeMo:mainfrom
guyueh1:fp8_moe_rollout
Nov 27, 2025
Merged

feat: Fp8 moe rollout#1446
terrykong merged 24 commits intoNVIDIA-NeMo:mainfrom
guyueh1:fp8_moe_rollout

Conversation

@guyueh1
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 commented Oct 29, 2025

What does this PR do ?

Support fp8 precision in generation phase for MoE models.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
COMMAND="uv run bash tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.sh" \
CONTAINER=<YOUR CONTAINER IMAGE> \
sbatch --acount <SLURM ACCOUNT> --nodes 1 --partition <SLURM PARTITION> ray.sub

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added new GRPO configuration with FP8 quantization support for optimized performance.
    • Generation KL Error metric now displayed during training results.
    • Extended FP8 quantization support to MoE (Mixture of Experts) models.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners October 29, 2025 18:03
guyueh1 and others added 6 commits October 31, 2025 09:49
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 2, 2025

📝 Walkthrough

Walkthrough

This pull request extends FP8 quantization support to MoE (Mixture of Experts) models within the generation pipeline. Changes include a new GRPO configuration file enabling FP8 inference with Megatron parallelism, enhancements to FP8 weight processing for FusedMoE modules, modified module traversal logic, and an additional training metric display.

Changes

Cohort / File(s) Summary
Configuration for FP8-enabled GRPO
examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
New configuration file setting up GRPO training with FP8 quantization, Megatron parallelism (tensor parallel size 1, pipeline parallel size 2), and vLLM inference with FP8 precision and deep GEMM support.
Training metrics enhancement
nemo_rl/algorithms/grpo.py
Added generation KL error metric printout to training results display alongside existing loss metrics.
FP8 support for MoE modules
nemo_rl/models/generation/fp8.py
Extended FP8 quantization path to handle FusedMoE weight processing: added FusedMoE module import and type checks, introduced process_weights_after_loading_moe handler for MoE-specific weight conversion, updated _get_module_from_param_name and _is_fp8_weight to recognize FusedMoE instances, relaxed MoE validation assertion, enhanced tensor padding in cast_tensor_to_fp8_blockwise, and patched Fp8MoEMethod.process_weights_after_loading.

Sequence Diagram(s)

sequenceDiagram
    participant Load as Model Loading
    participant Check as FP8 Detection
    participant Route as Module Routing
    participant Process as Weight Processing
    
    Load->>Check: Load model with FP8 enabled
    Check->>Route: Identify FusedMoE modules
    
    alt FusedMoE Module Found
        Route->>Process: Route to process_weights_after_loading_moe
        Process->>Process: Extract w13_weight, w2_weight
        Process->>Process: Pad tensors to block_size alignment
        Process->>Process: Cast to FP8 (blockwise)
        Process->>Process: Unpad results
    else Linear Module
        Route->>Process: Route to standard Linear processing
        Process->>Process: Cast weights to FP8
    end
    
    Process-->>Load: Ready for inference
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • FusedMoE weight processing logic: New process_weights_after_loading_moe function introduces MoE-specific tensor handling with padding/unpadding, requiring careful validation of tensor shape manipulations.
  • Module traversal changes: Early-return logic in _get_module_from_param_name and extended weight classification in _is_fp8_weight impact type detection across the model tree; edge cases with nested MoE structures need validation.
  • Patching behavior refinement: Changes to apply_fp8_patches affecting both Fp8LinearMethod and Fp8MoEMethod require verification that patch order and execution flow correctly.
  • Assertion removal: Removal of the MoE num_experts zero-check could affect backward compatibility; verify intended scope.

Suggested labels

CI:L1

Suggested reviewers

  • parthchadha
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning This PR introduces a major new feature adding FP8 precision support for Mixture-of-Experts (MoE) models during the generation phase. The changes include significant modifications to nemo_rl/models/generation/fp8.py (marked as "High" review effort), new configuration files, and weight processing logic. Since FP8 is a different precision format, these changes could affect numerical behavior and convergence. However, according to the PR objectives, the PR description explicitly states that "A placeholder for a usage example and a Python snippet is included but not populated" and critically, "none of these checklist items [adding tests, running unit/functional tests, and updating documentation] are marked complete." The PR lacks any documented test results, convergence/regression demonstrations, or performance comparisons required for changes of this magnitude. To pass this check, the PR description must be updated to include: (1) test results demonstrating the feature works correctly with FP8 MoE models, (2) evidence that there is no numerical regression or convergence degradation compared to non-FP8 baselines, and (3) performance metrics (before-and-after numbers) showing the impact of FP8 quantization on generation speed and model quality. Additionally, the contributor checklist items should be marked complete, example code should be populated, and documentation should be updated.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Fp8 moe rollout' directly relates to the main changes in the changeset, which add FP8 precision support for Mixture-of-Experts (MoE) models during generation, as evidenced by the new configuration file and FP8 handling enhancements in fp8.py.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2e645 and 13439ad.

📒 Files selected for processing (3)
  • examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml (1 hunks)
  • nemo_rl/algorithms/grpo.py (1 hunks)
  • nemo_rl/models/generation/fp8.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/algorithms/grpo.py
  • nemo_rl/models/generation/fp8.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/algorithms/grpo.py
  • nemo_rl/models/generation/fp8.py
examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/
.yaml

Files:

  • examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
🧠 Learnings (3)
📓 Common learnings
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

  • examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • nemo_rl/models/generation/fp8.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

Comment thread nemo_rl/algorithms/grpo.py
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 linked an issue Nov 3, 2025 that may be closed by this pull request
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested a review from a team as a code owner November 9, 2025 21:25
@guyueh1 guyueh1 self-assigned this Nov 9, 2025
@guyueh1 guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 9, 2025
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested a review from a team as a code owner November 9, 2025 21:38
@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Nov 9, 2025
@guyueh1 guyueh1 removed Documentation Improvements or additions to documentation CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested a review from terrykong November 25, 2025 17:23
@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L0 Run doctests and unit tests labels Nov 25, 2025
@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented Nov 25, 2025

@terrykong this is ready for review, I am running tests.

I had to skip the added MoE unit test (I kept it but it will be skipped) because the github CI doesn't have 8 gpus.

@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 25, 2025
Comment thread examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml Outdated
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than one comment, lgtm

Signed-off-by: root <root@pool0-01727.cm.cluster>
@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 26, 2025
@terrykong terrykong enabled auto-merge (squash) November 26, 2025 19:05
@terrykong terrykong merged commit b772e48 into NVIDIA-NeMo:main Nov 27, 2025
40 of 41 checks passed
DeL-TaiseiOzaki pushed a commit to DeL-TaiseiOzaki/RL that referenced this pull request Jan 8, 2026
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: root <root@pool0-01727.cm.cluster>
Co-authored-by: root <root@pool0-01727.cm.cluster>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: root <root@pool0-01727.cm.cluster>
Co-authored-by: root <root@pool0-01727.cm.cluster>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: root <root@pool0-01727.cm.cluster>
Co-authored-by: root <root@pool0-01727.cm.cluster>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: root <root@pool0-01727.cm.cluster>
Co-authored-by: root <root@pool0-01727.cm.cluster>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: root <root@pool0-01727.cm.cluster>
Co-authored-by: root <root@pool0-01727.cm.cluster>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve DeepSeek Generation Speed with FP8 Quantization FP8 generation in vLLM for MoEs

2 participants