fix Dtensor and tensor mismatch#42906
Conversation
eaeaae fix tensor parallel MoE test fix tensor parallel MoE test
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| tp_layer_instance = ALL_PARALLEL_STYLES[model.tp_plan[matched_tp_pattern]] | ||
| tp_layer = tp_layer_instance.__class__ | ||
| mapping.distributed_operation = tp_layer( | ||
| device_mesh=device_mesh, rank=device_map[""].index, empty_param=empty_param.clone() | ||
| ) | ||
| mapping.distributed_operation.use_dtensor = tp_layer_instance.use_dtensor |
There was a problem hiding this comment.
a nice hack, tho if we can come up with a better fix lets try to avoid that please!
the kwargs should only be
device_mesh=device_mesh, rank=device_map[""].index, empty_param=empty_param.clone()
for init, the rest should not be kwargs of init more like hardcoded for that "type".
If you see what I mean here we should only get the class and init it -> local_colwise should get its stuff
ArthurZucker
left a comment
There was a problem hiding this comment.
Looks good to me otherwise, the test will come later AFAIK with your PR on fast distributed tests
| router_indices == -1, num_local_experts | ||
| ) # masking class for one hot | ||
| return router_scores, router_indices | ||
| return router_logits, router_scores, router_indices |
There was a problem hiding this comment.
unsure this works for all models but let's see!
There was a problem hiding this comment.
yeah I need to fix the Expert parallel anyway so we will see
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42906&sha=12ff9a |
* begin Moe test tensor parallel * create tiny moe model + fix test tensor parallel Moe eaeaae * create tiny moe model + fix test tensor parallel Moe eaeaae fix tensor parallel MoE test fix tensor parallel MoE test * fix backward pass test in tensor parallel for Dense model (huggingface#42811) * fix * linting * use mixtral instead for testing * fix dtensor and tensor mismatch * linting * checkout test tensor parallel to be like main * avoid hack and create class instead
Bug
local_rowiseorlocal_colwiseis callingRowiseParallel(use_dtensor=False)(resp.ColwiseParallel(use_dtensor=False). Issue was first noticed in #42356 , quotingFix
was creating the bug because
.__class__will reuse the class defaultuse_dtensorvalue. Thus, overwritting the value we specifed inlocal_rowise/colwise.The fix makes sure to properly use the
use_dtensorvalue and thus no moreDtensorandtensormismatch