feat: Fp8 moe rollout by guyueh1 · Pull Request #1446 · NVIDIA-NeMo/RL

guyueh1 · 2025-10-29T18:03:47Z

What does this PR do ?

Support fp8 precision in generation phase for MoE models.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

COMMAND="uv run bash tests/test_suites/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.sh" \
CONTAINER=<YOUR CONTAINER IMAGE> \
sbatch --acount <SLURM ACCOUNT> --nodes 1 --partition <SLURM PARTITION> ray.sub

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added new GRPO configuration with FP8 quantization support for optimized performance.
- Generation KL Error metric now displayed during training results.
- Extended FP8 quantization support to MoE (Mixture of Experts) models.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

…-RL into fix_fp8_rollout_dense

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai · 2025-11-02T23:14:58Z

📝 Walkthrough

Walkthrough

This pull request extends FP8 quantization support to MoE (Mixture of Experts) models within the generation pipeline. Changes include a new GRPO configuration file enabling FP8 inference with Megatron parallelism, enhancements to FP8 weight processing for FusedMoE modules, modified module traversal logic, and an additional training metric display.

Changes

Cohort / File(s)	Summary
Configuration for FP8-enabled GRPO `examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml`	New configuration file setting up GRPO training with FP8 quantization, Megatron parallelism (tensor parallel size 1, pipeline parallel size 2), and vLLM inference with FP8 precision and deep GEMM support.
Training metrics enhancement `nemo_rl/algorithms/grpo.py`	Added generation KL error metric printout to training results display alongside existing loss metrics.
FP8 support for MoE modules `nemo_rl/models/generation/fp8.py`	Extended FP8 quantization path to handle FusedMoE weight processing: added FusedMoE module import and type checks, introduced `process_weights_after_loading_moe` handler for MoE-specific weight conversion, updated `_get_module_from_param_name` and `_is_fp8_weight` to recognize FusedMoE instances, relaxed MoE validation assertion, enhanced tensor padding in `cast_tensor_to_fp8_blockwise`, and patched `Fp8MoEMethod.process_weights_after_loading`.

Sequence Diagram(s)

sequenceDiagram
    participant Load as Model Loading
    participant Check as FP8 Detection
    participant Route as Module Routing
    participant Process as Weight Processing
    
    Load->>Check: Load model with FP8 enabled
    Check->>Route: Identify FusedMoE modules
    
    alt FusedMoE Module Found
        Route->>Process: Route to process_weights_after_loading_moe
        Process->>Process: Extract w13_weight, w2_weight
        Process->>Process: Pad tensors to block_size alignment
        Process->>Process: Cast to FP8 (blockwise)
        Process->>Process: Unpad results
    else Linear Module
        Route->>Process: Route to standard Linear processing
        Process->>Process: Cast weights to FP8
    end
    
    Process-->>Load: Ready for inference

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

FusedMoE weight processing logic: New process_weights_after_loading_moe function introduces MoE-specific tensor handling with padding/unpadding, requiring careful validation of tensor shape manipulations.
Module traversal changes: Early-return logic in _get_module_from_param_name and extended weight classification in _is_fp8_weight impact type detection across the model tree; edge cases with nested MoE structures need validation.
Patching behavior refinement: Changes to apply_fp8_patches affecting both Fp8LinearMethod and Fp8MoEMethod require verification that patch order and execution flow correctly.
Assertion removal: Removal of the MoE num_experts zero-check could affect backward compatibility; verify intended scope.

Suggested labels

CI:L1

Suggested reviewers

parthchadha
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	This PR introduces a major new feature adding FP8 precision support for Mixture-of-Experts (MoE) models during the generation phase. The changes include significant modifications to `nemo_rl/models/generation/fp8.py` (marked as "High" review effort), new configuration files, and weight processing logic. Since FP8 is a different precision format, these changes could affect numerical behavior and convergence. However, according to the PR objectives, the PR description explicitly states that "A placeholder for a usage example and a Python snippet is included but not populated" and critically, "none of these checklist items [adding tests, running unit/functional tests, and updating documentation] are marked complete." The PR lacks any documented test results, convergence/regression demonstrations, or performance comparisons required for changes of this magnitude.	To pass this check, the PR description must be updated to include: (1) test results demonstrating the feature works correctly with FP8 MoE models, (2) evidence that there is no numerical regression or convergence degradation compared to non-FP8 baselines, and (3) performance metrics (before-and-after numbers) showing the impact of FP8 quantization on generation speed and model quality. Additionally, the contributor checklist items should be marked complete, example code should be populated, and documentation should be updated.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Fp8 moe rollout' directly relates to the main changes in the changeset, which add FP8 precision support for Mixture-of-Experts (MoE) models during generation, as evidenced by the new configuration file and FP8 handling enhancements in fp8.py.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2e645 and 13439ad.

📒 Files selected for processing (3)

examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml (1 hunks)
nemo_rl/algorithms/grpo.py (1 hunks)
nemo_rl/models/generation/fp8.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/fp8.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/fp8.py

examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml

Files:

examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml

🧠 Learnings (3)

📓 Common learnings

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

📚 Learning: 2025-09-18T14:57:31.003Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml

📚 Learning: 2025-10-30T20:50:44.126Z

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

nemo_rl/models/generation/fp8.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2025-11-25T17:24:18Z

@terrykong this is ready for review, I am running tests.

I had to skip the added MoE unit test (I kept it but it will be skipped) because the github CI doesn't have 8 gpus.

terrykong

other than one comment, lgtm

Signed-off-by: root <root@pool0-01727.cm.cluster>

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: root <root@pool0-01727.cm.cluster> Co-authored-by: root <root@pool0-01727.cm.cluster>

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: root <root@pool0-01727.cm.cluster> Co-authored-by: root <root@pool0-01727.cm.cluster> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: root <root@pool0-01727.cm.cluster> Co-authored-by: root <root@pool0-01727.cm.cluster>

guyueh1 added 2 commits October 27, 2025 11:56

Fix process_weights_after_loading for fp8 dense

41abdf1

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Support moe fp8 rollout qwen

9ba1bd2

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested review from a team as code owners October 29, 2025 18:03

guyueh1 and others added 6 commits October 31, 2025 09:49

Merge branch 'main' into fix_fp8_rollout_dense

c1d7fd2

Change printed metric to gen_kl_error

bf1ad6b

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Fix for non deep_gemm weights

c3b5f73

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Fix disable_tp=True

958d745

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Merge branch 'fix_fp8_rollout_dense' of ssh://github.com/guyueh1/NeMo…

0f7128e

…-RL into fix_fp8_rollout_dense

Merge branch 'fix_fp8_rollout_dense' into fp8_moe_rollout

13439ad

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai Bot reviewed Nov 2, 2025

View reviewed changes

Comment thread nemo_rl/algorithms/grpo.py

Allow ignore layers

b407c26

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 linked an issue Nov 3, 2025 that may be closed by this pull request

FP8 generation in vLLM for MoEs #978

Closed

guyueh1 added 6 commits November 6, 2025 14:44

Merge branch 'main' into fix_fp8_rollout_dense

4cc2976

simplify code

22f615e

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Rm unused code

335a3e2

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Fix unit test; log KL in async function also

78e45b6

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Merge branch 'fix_fp8_rollout_dense' into fp8_moe_rollout

850d416

Add unit test and nightly test

2f0b7dd

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested a review from a team as a code owner November 9, 2025 21:25

guyueh1 self-assigned this Nov 9, 2025

guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 9, 2025

guyueh1 temporarily deployed to nemo-ci November 9, 2025 21:26 — with GitHub Actions Inactive

update doc

96b5fee

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested a review from a team as a code owner November 9, 2025 21:38

github-actions Bot added the Documentation Improvements or additions to documentation label Nov 9, 2025

guyueh1 removed Documentation Improvements or additions to documentation CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025

Change unit test to skip moe tests

d4a6be7

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested a review from terrykong November 25, 2025 17:23

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L0 Run doctests and unit tests labels Nov 25, 2025

guyueh1 had a problem deploying to nemo-ci November 25, 2025 17:23 — with GitHub Actions Error

Merge branch 'main' into fp8_moe_rollout

d1ad97c

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 25, 2025

guyueh1 temporarily deployed to nemo-ci November 25, 2025 17:25 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci November 25, 2025 18:28 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci November 25, 2025 20:57 — with GitHub Actions Inactive

terrykong reviewed Nov 26, 2025

View reviewed changes

Comment thread examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml Outdated

terrykong reviewed Nov 26, 2025

View reviewed changes

Rm untested config

6640457

Signed-off-by: root <root@pool0-01727.cm.cluster>

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 26, 2025

guyueh1 temporarily deployed to nemo-ci November 26, 2025 18:50 — with GitHub Actions Inactive

terrykong enabled auto-merge (squash) November 26, 2025 19:05

terrykong approved these changes Nov 26, 2025

View reviewed changes

guyueh1 temporarily deployed to nemo-ci November 26, 2025 22:02 — with GitHub Actions Inactive

terrykong merged commit b772e48 into NVIDIA-NeMo:main Nov 27, 2025
40 of 41 checks passed

This was referenced Dec 19, 2025

fix: Fix fp8 after vllm v0.11.2 bump #1660

Merged

cp: fix: Fix fp8 after vllm v0.11.2 bump (1660) into r0.5.0 #1673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Fp8 moe rollout#1446

feat: Fp8 moe rollout#1446
terrykong merged 24 commits intoNVIDIA-NeMo:mainfrom
guyueh1:fp8_moe_rollout

guyueh1 commented Oct 29, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Nov 2, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

guyueh1 commented Nov 25, 2025

Uh oh!

Uh oh!

terrykong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guyueh1 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guyueh1 commented Nov 25, 2025

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guyueh1 commented Oct 29, 2025 •

edited

Loading

coderabbitai Bot commented Nov 2, 2025 •

edited

Loading