fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled by ZhiyuLi-Nvidia · Pull Request #674 · NVIDIA-NeMo/RL

ZhiyuLi-Nvidia · 2025-07-16T02:17:09Z

What does this PR do ?

fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled

model.layers.x.mlp.gate.e_score_correction_bias dtype is changing between step 1 to the following steps, which make it impossible to cache para dtype with inconsistency issue:

f"ERROR: {key}: {shape} != {shape_ref}", where shape is about 1st step, while shape_ref is the 2nd step

('ERROR: model.layers.3.mlp.gate.e_score_correction_bias: torch.bfloat16 != torch.float32',)

Details

hf's model.layers.x.mlp.gate.e_score_correction_bias originates from decoder.layers.x.mlp.router.expert_bias in mcore
decoder.layers.x.mlp.router.expert_bias should be starting with fp32 here and always fp32 here
there seems something wrong in setup_megatron_model
- https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/models/policy/megatron_policy_worker.py#L192-L199
- There's hidden cast from fp32 to bf16 (param_dtype) despite fp32 param initialization: https://github.com/NVIDIA/NeMo/blob/33259f2540af6eef375d43fc48bdcbd7ec490c29/nemo/tron/model.py#L119-L121

avoid serializing rebuild_cuda_tensor function
serializing funciton is costly in transfer cudaipc between workers. Got some performance improvement with the change.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Yuki Huang <yukih@nvidia.com>

…mcore for speedup Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

yuki-97 · 2025-07-16T03:53:55Z

Target to yukih/prepare-refit-info first for comparing the modification easily.
Will target to main after #638 merged.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

The base branch was changed.

ZhiyuLi-Nvidia · 2025-07-23T21:01:21Z

Close thie PR since it is part of #686.

ZhiyuLi-Nvidia force-pushed the zhiyul/yukih/prepare-refit-info branch from 061c3fc to 72ceb4f Compare July 16, 2025 02:43

ZhiyuLi-Nvidia changed the title ~~Zhiyul/yukih/prepare refit info~~ fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled Jul 16, 2025

ZhiyuLi-Nvidia requested review from yfw and yuki-97 July 16, 2025 02:47

yuki-97 added 11 commits July 16, 2025 03:38

move some code to refit util

71380f5

Signed-off-by: Yuki Huang <yukih@nvidia.com>

prepare refit info once

8fef285

Signed-off-by: Yuki Huang <yukih@nvidia.com>

rename update_weights

42613ca

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix rebase

26af037

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add metainfo in prepare_refit_info

d8e48f3

Signed-off-by: Yuki Huang <yukih@nvidia.com>

some fix and update unit test with prepare_refit_info

eb6b910

Signed-off-by: Yuki Huang <yukih@nvidia.com>

support update dtype record during training

23c5d3a

Signed-off-by: Yuki Huang <yukih@nvidia.com>

not cache refit_param_info since dtype may change

687c868

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add unit test

cabac3b

Signed-off-by: Yuki Huang <yukih@nvidia.com>

rename some vars

a88ca8d

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add NRL_REFIT_BUFFER_MEMORY_RATIO, update default from 10% to 20% in …

f152f61

…mcore for speedup Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/prepare-refit-info branch from c685241 to f152f61 Compare July 16, 2025 03:40

yuki-97 changed the base branch from yukih/prepare-refit-info to main July 16, 2025 03:44

ZhiyuLi-Nvidia added 2 commits July 16, 2025 03:47

fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled

1da2f5e

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

avoid serializing rebuild_cuda_tensor function

a6018ab

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

yuki-97 force-pushed the zhiyul/yukih/prepare-refit-info branch from 72ceb4f to a6018ab Compare July 16, 2025 03:51

yuki-97 changed the base branch from main to yukih/prepare-refit-info July 16, 2025 03:52

yfw previously approved these changes Jul 16, 2025

View reviewed changes

assert dtype

ad1d3ba

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the zhiyul/yukih/prepare-refit-info branch from fd433d3 to ad1d3ba Compare July 16, 2025 12:50

yuki-97 mentioned this pull request Jul 16, 2025

feat: optimize refit by preparing refit info ahead of time #638

Merged

Base automatically changed from yukih/prepare-refit-info to main July 16, 2025 21:50

ZhiyuLi-Nvidia closed this Jul 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled#674

fix: maintain fp32 mlp.router.expert_bias even with bf16 enabled#674
ZhiyuLi-Nvidia wants to merge 14 commits intomainfrom
zhiyul/yukih/prepare-refit-info

ZhiyuLi-Nvidia commented Jul 16, 2025 •

edited

Loading

Uh oh!

yuki-97 commented Jul 16, 2025

Uh oh!

ZhiyuLi-Nvidia commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhiyuLi-Nvidia commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?