Skip to content

Fix GPT-OSS TP IndexError and unwrapping DTensor#42356

Closed
akshan-main wants to merge 5 commits intohuggingface:mainfrom
akshan-main:fix-gpt-oss-tp
Closed

Fix GPT-OSS TP IndexError and unwrapping DTensor#42356
akshan-main wants to merge 5 commits intohuggingface:mainfrom
akshan-main:fix-gpt-oss-tp

Conversation

@akshan-main
Copy link
Copy Markdown

@akshan-main akshan-main commented Nov 24, 2025

[GPT-OSS] Fix Tensor Parallelism IndexError and DTensor casting

What does this PR do?

This PR fixes two specific issues preventing GPT-OSS models from training with Tensor Parallelism (TP) and FSDP, as reported in #41819.

The changes are:

  1. Fix IndexError in TP Hooks:

    • Bug: The tensor_parallel hooks in transformers expect the input tensor (hidden states) to be passed as the first positional argument (args[0]). The GptOssDecoderLayer was previously passing hidden_states as a keyword argument, causing the hook to fail with IndexError: tuple index out of range.
    • Fix: Updated GptOssDecoderLayer.forward to pass hidden_states as the first positional argument.
  2. Fix DTensor Casting in Eager Attention:

    • Bug: When TP is enabled, module.sinks is wrapped as a DTensor. The eager_attention_forward function attempts to torch.cat this with attn_weights (a local tensor), causing a crash.
    • Fix: Added a check to detect if sinks is a DTensor and unwrap it before concatenation.

Status:
I have applied these changes to modular_gpt_oss.py and ran make fix-copies

Fixes #41819

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case. (Linked above)
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@3outeille
@ArthurZucker

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt_oss

@akshan-main
Copy link
Copy Markdown
Author

Hey @3outeille @ArthurZucker ! It'd be great if this pr can be reviewed and you'd want me to integrate anything else


sinks = module.sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
sinks = module.sinks
if type(sinks).__name__ == "DTensor":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would like to not have Dtensor logic in the modeling. For example, sinks are supposed to use local_rowwise (cf https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_oss/configuration_gpt_oss.py#L41) which is supposed to not return a Dtensor (cf https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/tensor_parallel.py#L1171) but somehow doesnt work

I think cleanest way to handle this without modifying the HF modeling would be to understand why the sinks is still Dtensor after local_rowwise

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akshan-main , @3outeille , attn_weights should also be Dtensor right, if the model is prepared for tp_auto. When accelerate uses _prepare_tp function, it first prepares the model by converting all the model parameters to Dtensor.


sinks = module.sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
sinks = module.sinks
if type(sinks).__name__ == "DTensor":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same statement as above about Dtensor and local_rowwise

@3outeille
Copy link
Copy Markdown
Member

3outeille commented Nov 28, 2025

sorry context switching so hard that I forgot to click on submit the reviews lol

@akshan-main
Copy link
Copy Markdown
Author

sorry context switching so hard that I forgot to click on submit the reviews lol

haha understandable! I will work on trying to fix this

@quic-akuruvil
Copy link
Copy Markdown

quic-akuruvil commented Dec 11, 2025

Hi @akshan-main, I am also seeing DTensor type for self.weights and self.bias in GptOssTopKRouter module. And at multiple other places

@3outeille
Copy link
Copy Markdown
Member

@akshan-main @quic-akuruvil encountered the issue again and had to fix it: #42906

@3outeille 3outeille closed this Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IndexError: tuple index out of range when using Tensor Parallelism with FSDP2 on GPT-OSS 20B (tensor_parallel.py, line 510)

3 participants