Currently general FSDP2 for non-MoE models all run well, but not run well with MoE models. (e.g., Qwen3-30B-A3B, DeepSeek-V2-Lite)
-
for Qwen3-30B-A3B, it is obviously slower than Qwen3-32B, especially on the refit process or using hf-tp-plan with dtensor tp > 1.

-
for DeepSeek-V2-Lite, fail on the following error on model.layers.0.self_attn.rotary_emb.cos_cached , said v.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64])
File "/workspace/nemo_rl/models/policy/dtensor_policy_worker.py", line 649, in get_reference_policy_logprobs
with self.use_reference_model():
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yukih/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/workspacenemo_rl/models/policy/dtensor_policy_worker.py", line 626, in use_reference_model
val.copy_(self.reference_model_buffers[k])
RuntimeError: The size of tensor a (2048) must match the size of tensor b (163840) at non-singleton dimension 0
Currently general FSDP2 for non-MoE models all run well, but not run well with MoE models. (e.g.,
Qwen3-30B-A3B,DeepSeek-V2-Lite)for

Qwen3-30B-A3B, it is obviously slower than Qwen3-32B, especially on the refit process or using hf-tp-plan with dtensor tp > 1.for
DeepSeek-V2-Lite, fail on the following error onmodel.layers.0.self_attn.rotary_emb.cos_cached, saidv.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64])