Skip to content

fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes#2320

Open
zpqiu wants to merge 5 commits intomainfrom
feat/qwen35-moe-fused-adam
Open

fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes#2320
zpqiu wants to merge 5 commits intomainfrom
feat/qwen35-moe-fused-adam

Conversation

@zpqiu
Copy link
Copy Markdown
Contributor

@zpqiu zpqiu commented Apr 23, 2026

What does this PR do ?

Enable TE FusedAdam in the Qwen3.5-35B-A3B automodel GRPO recipes and allow string-valued torch.* dtypes in automodel optimizer kwargs so FusedAdam's exp_avg_dtype / exp_avg_sq_dtype can be set from YAML.

Issues

List issues that this PR closes (syntax):

#2322

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

🤖 Generated with Claude Code

Resolve string-valued torch dtypes (e.g. "torch.bfloat16") in the
automodel optimizer kwargs so TE FusedAdam's exp_avg_dtype and
exp_avg_sq_dtype can be specified from YAML.

Migrate the three Qwen3.5-35B-A3B automodel GRPO recipes (llm 2n8g
EP16, llm DAPO 4n8g, vlm geo3k 2n8g EP16) from torch.optim.AdamW to
transformer_engine.pytorch.optimizers.fused_adam.FusedAdam, carrying
over lr/weight_decay/betas/eps from the prior settings. Use
_override_: true on the optimizer block so the base grpo_math_1B.yaml
optimizer config (including foreach/fused=False, which FusedAdam does
not accept) is replaced rather than merged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@zpqiu zpqiu added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Apr 23, 2026
@zpqiu
Copy link
Copy Markdown
Contributor Author

zpqiu commented Apr 23, 2026

/ok to test 4c71574

@zpqiu zpqiu marked this pull request as ready for review April 23, 2026 09:56
@zpqiu zpqiu requested review from a team as code owners April 23, 2026 09:56
@zpqiu zpqiu marked this pull request as draft April 23, 2026 10:12
The `foreach: False` / `fused: False` kwargs were added ~1 year ago in
the initial FSDP2/DTensor support PR (#131, commit 085fa66) as a
defensive measure for DTensor compatibility. PyTorch DTensor has since
added native `_foreach_*` kernel coverage and the auto-selected defaults
(`foreach=None`, `fused=None`) are correct for DTensor tensors on the
currently pinned `torch==2.10.0`.

Dropping these from the base unblocks using TE FusedAdam on recipes that
inherit from grpo_math_1B.yaml without needing the previous
`_override_: true` trick, because FusedAdam does not accept those
AdamW-only kwargs. Re-minimizes the three Qwen3.5 MoE automodel recipes
accordingly: the `_override_: true` markers are removed and kwargs that
now match the (cleaner) base are elided.

Scoped to grpo_math_1B.yaml only; the same cleanup for sft/dpo/rm base
configs is deferred to a follow-up once this change is validated in
nightly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
@zpqiu
Copy link
Copy Markdown
Contributor Author

zpqiu commented Apr 23, 2026

/ok to test 8b4f3b2

@zpqiu zpqiu marked this pull request as ready for review April 23, 2026 14:33
Apply the same FusedAdam migration used for the Qwen3.5 MoE recipes:
switch torch.optim.AdamW to
transformer_engine.pytorch.optimizers.fused_adam.FusedAdam with
master_weights=True so the optimizer keeps an internal FP32 master copy,
bypassing the missing FP32 master weights on the Automodel custom MoE
path. lr / weight_decay are unchanged; betas / eps are inherited from
the base grpo_math_1B.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
@zpqiu zpqiu changed the title fix: enable TE FusedAdam for Qwen3.5 MoE automodel recipes fix: enable TE FusedAdam for Qwen3.5 MoE & GLM-4.7-Flash automodel recipes Apr 23, 2026
@zpqiu
Copy link
Copy Markdown
Contributor Author

zpqiu commented Apr 23, 2026

/ok to test 58cc3d7

TE FusedAdam's step() allocates per-parameter state
(exp_avg/exp_avg_sq/master_param) before the p.grad-is-None check, so
frozen parameters (e.g. the visual encoder in text-only training) still
get optimizer state entries. DCP then saves that state, and the next
resume fails inside gather_object with a misleading "cannot pickle code
objects" (DCP's _wrap_exception captures the real "Size mismatch"
ValueError whose traceback contains a CodeType).

Pass only requires_grad=True parameters to the optimizer so the frozen
visual subtree never enters optimizer state in the first place. This
also matches the standard PyTorch idiom and works regardless of which
optimizer backend the recipe selects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant