Support MoE models in FSDP2

Currently general FSDP2 for non-MoE models all run well, but not run well with MoE models. (e.g., `Qwen3-30B-A3B`, `DeepSeek-V2-Lite`)
1. for `Qwen3-30B-A3B`, it is obviously slower than Qwen3-32B, especially on the refit process or using hf-tp-plan with dtensor tp > 1.
![Image](https://github.com/user-attachments/assets/1b903def-4b0f-48e3-8d15-8aadcea6a984)

2. for `DeepSeek-V2-Lite`, fail on the following error on `model.layers.0.self_attn.rotary_emb.cos_cached` , said `v.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64])`
  ```
File "/workspace/nemo_rl/models/policy/dtensor_policy_worker.py", line 649, in get_reference_policy_logprobs
    with self.use_reference_model():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yukih/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/workspacenemo_rl/models/policy/dtensor_policy_worker.py", line 626, in use_reference_model
    val.copy_(self.reference_model_buffers[k])
RuntimeError: The size of tensor a (2048) must match the size of tensor b (163840) at non-singleton dimension 0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MoE models in FSDP2 #413

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support MoE models in FSDP2 #413

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions