fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes by zpqiu · Pull Request #2320 · NVIDIA-NeMo/RL

zpqiu · 2026-04-23T07:18:09Z

What does this PR do ?

Enable TE FusedAdam in the Qwen3.5-35B-A3B automodel GRPO recipes and allow string-valued torch.* dtypes in automodel optimizer kwargs so FusedAdam's exp_avg_dtype / exp_avg_sq_dtype can be set from YAML.

Issues

List issues that this PR closes (syntax):

#2322

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

The foreach: False / fused: False kwargs were a defensive default from the initial DTensor support PR (feat: Add FSDP2, DTensor SP/TP, activation checkpointing support #131, ~1 year ago). Nothing in the codebase depends on them (grep -r "foreach=False|fused=False" nemo_rl/ is empty).
FusedAdam doesn't accept these two extra kwargs.

🤖 Generated with Claude Code

Resolve string-valued torch dtypes (e.g. "torch.bfloat16") in the automodel optimizer kwargs so TE FusedAdam's exp_avg_dtype and exp_avg_sq_dtype can be specified from YAML. Migrate the three Qwen3.5-35B-A3B automodel GRPO recipes (llm 2n8g EP16, llm DAPO 4n8g, vlm geo3k 2n8g EP16) from torch.optim.AdamW to transformer_engine.pytorch.optimizers.fused_adam.FusedAdam, carrying over lr/weight_decay/betas/eps from the prior settings. Use _override_: true on the optimizer block so the base grpo_math_1B.yaml optimizer config (including foreach/fused=False, which FusedAdam does not accept) is replaced rather than merged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

copy-pr-bot · 2026-04-23T07:18:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

zpqiu · 2026-04-23T07:49:34Z

/ok to test 4c71574

The `foreach: False` / `fused: False` kwargs were added ~1 year ago in the initial FSDP2/DTensor support PR (#131, commit 085fa66) as a defensive measure for DTensor compatibility. PyTorch DTensor has since added native `_foreach_*` kernel coverage and the auto-selected defaults (`foreach=None`, `fused=None`) are correct for DTensor tensors on the currently pinned `torch==2.10.0`. Dropping these from the base unblocks using TE FusedAdam on recipes that inherit from grpo_math_1B.yaml without needing the previous `_override_: true` trick, because FusedAdam does not accept those AdamW-only kwargs. Re-minimizes the three Qwen3.5 MoE automodel recipes accordingly: the `_override_: true` markers are removed and kwargs that now match the (cleaner) base are elided. Scoped to grpo_math_1B.yaml only; the same cleanup for sft/dpo/rm base configs is deferred to a follow-up once this change is validated in nightly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

zpqiu · 2026-04-23T13:57:12Z

/ok to test 8b4f3b2

Apply the same FusedAdam migration used for the Qwen3.5 MoE recipes: switch torch.optim.AdamW to transformer_engine.pytorch.optimizers.fused_adam.FusedAdam with master_weights=True so the optimizer keeps an internal FP32 master copy, bypassing the missing FP32 master weights on the Automodel custom MoE path. lr / weight_decay are unchanged; betas / eps are inherited from the base grpo_math_1B.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

zpqiu · 2026-04-23T15:04:18Z

/ok to test 58cc3d7

TE FusedAdam's step() allocates per-parameter state (exp_avg/exp_avg_sq/master_param) before the p.grad-is-None check, so frozen parameters (e.g. the visual encoder in text-only training) still get optimizer state entries. DCP then saves that state, and the next resume fails inside gather_object with a misleading "cannot pickle code objects" (DCP's _wrap_exception captures the real "Size mismatch" ValueError whose traceback contains a CodeType). Pass only requires_grad=True parameters to the optimizer so the frozen visual subtree never enters optimizer state in the first place. This also matches the standard PyTorch idiom and works regardless of which optimizer backend the recipe selects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

Merge branch 'main' into feat/qwen35-moe-fused-adam

4c71574

zpqiu added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Apr 23, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 07:50 Inactive

zpqiu mentioned this pull request Apr 23, 2026

Qwen3.5 MoE Automodel path does not use FP32 master weights by default, causing slower convergence #2322

Open

2 tasks

zpqiu marked this pull request as ready for review April 23, 2026 09:56

zpqiu requested review from a team as code owners April 23, 2026 09:56

zpqiu marked this pull request as draft April 23, 2026 10:12

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 13:57 Inactive

zpqiu marked this pull request as ready for review April 23, 2026 14:33

zpqiu changed the title ~~fix: enable TE FusedAdam for Qwen3.5 MoE automodel recipes~~ fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes Apr 23, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 15:05 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes#2320

fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes#2320
zpqiu wants to merge 5 commits intomainfrom
feat/qwen35-moe-fused-adam

zpqiu commented Apr 23, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

zpqiu commented Apr 23, 2026

Uh oh!

zpqiu commented Apr 23, 2026

Uh oh!

zpqiu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zpqiu commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

zpqiu commented Apr 23, 2026

Uh oh!

zpqiu commented Apr 23, 2026

Uh oh!

zpqiu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zpqiu commented Apr 23, 2026 •

edited

Loading