Fix GPT-OSS TP IndexError and unwrapping DTensor#42356
Fix GPT-OSS TP IndexError and unwrapping DTensor#42356akshan-main wants to merge 5 commits intohuggingface:mainfrom
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gpt_oss |
|
Hey @3outeille @ArthurZucker ! It'd be great if this pr can be reviewed and you'd want me to integrate anything else |
|
|
||
| sinks = module.sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1) | ||
| sinks = module.sinks | ||
| if type(sinks).__name__ == "DTensor": |
There was a problem hiding this comment.
we would like to not have Dtensor logic in the modeling. For example, sinks are supposed to use local_rowwise (cf https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_oss/configuration_gpt_oss.py#L41) which is supposed to not return a Dtensor (cf https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/tensor_parallel.py#L1171) but somehow doesnt work
I think cleanest way to handle this without modifying the HF modeling would be to understand why the sinks is still Dtensor after local_rowwise
There was a problem hiding this comment.
Hi @akshan-main , @3outeille , attn_weights should also be Dtensor right, if the model is prepared for tp_auto. When accelerate uses _prepare_tp function, it first prepares the model by converting all the model parameters to Dtensor.
|
|
||
| sinks = module.sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1) | ||
| sinks = module.sinks | ||
| if type(sinks).__name__ == "DTensor": |
There was a problem hiding this comment.
same statement as above about Dtensor and local_rowwise
|
sorry context switching so hard that I forgot to click on |
haha understandable! I will work on trying to fix this |
|
Hi @akshan-main, I am also seeing DTensor type for |
|
@akshan-main @quic-akuruvil encountered the issue again and had to fix it: #42906 |
[GPT-OSS] Fix Tensor Parallelism IndexError and DTensor casting
What does this PR do?
This PR fixes two specific issues preventing GPT-OSS models from training with Tensor Parallelism (TP) and FSDP, as reported in #41819.
The changes are:
Fix
IndexErrorin TP Hooks:tensor_parallelhooks in transformers expect the input tensor (hidden states) to be passed as the first positional argument (args[0]). TheGptOssDecoderLayerwas previously passinghidden_statesas a keyword argument, causing the hook to fail withIndexError: tuple index out of range.GptOssDecoderLayer.forwardto passhidden_statesas the first positional argument.Fix DTensor Casting in Eager Attention:
module.sinksis wrapped as aDTensor. Theeager_attention_forwardfunction attempts totorch.catthis withattn_weights(a local tensor), causing a crash.sinksis aDTensorand unwrap it before concatenation.Status:
I have applied these changes to
modular_gpt_oss.pyand ranmake fix-copiesFixes #41819
Before submitting
Pull Request section?
to it if that's the case. (Linked above)
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@3outeille
@ArthurZucker