fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes#2320
Open
fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes#2320
Conversation
Resolve string-valued torch dtypes (e.g. "torch.bfloat16") in the automodel optimizer kwargs so TE FusedAdam's exp_avg_dtype and exp_avg_sq_dtype can be specified from YAML. Migrate the three Qwen3.5-35B-A3B automodel GRPO recipes (llm 2n8g EP16, llm DAPO 4n8g, vlm geo3k 2n8g EP16) from torch.optim.AdamW to transformer_engine.pytorch.optimizers.fused_adam.FusedAdam, carrying over lr/weight_decay/betas/eps from the prior settings. Use _override_: true on the optimizer block so the base grpo_math_1B.yaml optimizer config (including foreach/fused=False, which FusedAdam does not accept) is replaced rather than merged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Contributor
Author
|
/ok to test 4c71574 |
2 tasks
The `foreach: False` / `fused: False` kwargs were added ~1 year ago in the initial FSDP2/DTensor support PR (#131, commit 085fa66) as a defensive measure for DTensor compatibility. PyTorch DTensor has since added native `_foreach_*` kernel coverage and the auto-selected defaults (`foreach=None`, `fused=None`) are correct for DTensor tensors on the currently pinned `torch==2.10.0`. Dropping these from the base unblocks using TE FusedAdam on recipes that inherit from grpo_math_1B.yaml without needing the previous `_override_: true` trick, because FusedAdam does not accept those AdamW-only kwargs. Re-minimizes the three Qwen3.5 MoE automodel recipes accordingly: the `_override_: true` markers are removed and kwargs that now match the (cleaner) base are elided. Scoped to grpo_math_1B.yaml only; the same cleanup for sft/dpo/rm base configs is deferred to a follow-up once this change is validated in nightly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Contributor
Author
|
/ok to test 8b4f3b2 |
Apply the same FusedAdam migration used for the Qwen3.5 MoE recipes: switch torch.optim.AdamW to transformer_engine.pytorch.optimizers.fused_adam.FusedAdam with master_weights=True so the optimizer keeps an internal FP32 master copy, bypassing the missing FP32 master weights on the Automodel custom MoE path. lr / weight_decay are unchanged; betas / eps are inherited from the base grpo_math_1B.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Contributor
Author
|
/ok to test 58cc3d7 |
TE FusedAdam's step() allocates per-parameter state (exp_avg/exp_avg_sq/master_param) before the p.grad-is-None check, so frozen parameters (e.g. the visual encoder in text-only training) still get optimizer state entries. DCP then saves that state, and the next resume fails inside gather_object with a misleading "cannot pickle code objects" (DCP's _wrap_exception captures the real "Size mismatch" ValueError whose traceback contains a CodeType). Pass only requires_grad=True parameters to the optimizer so the frozen visual subtree never enters optimizer state in the first place. This also matches the standard PyTorch idiom and works regardless of which optimizer backend the recipe selects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Enable TE FusedAdam in the Qwen3.5-35B-A3B automodel GRPO recipes and allow string-valued
torch.*dtypes in automodel optimizer kwargs so FusedAdam'sexp_avg_dtype/exp_avg_sq_dtypecan be set from YAML.Issues
List issues that this PR closes (syntax):
#2322
Before your PR is "Ready for review"
Pre checks:
Additional Information
foreach: False/fused: Falsekwargs were a defensive default from the initial DTensor support PR (feat: Add FSDP2, DTensor SP/TP, activation checkpointing support #131, ~1 year ago). Nothing in the codebase depends on them (grep -r "foreach=False|fused=False" nemo_rl/ is empty).🤖 Generated with Claude Code