Skip to content

properly save model across tensor parallel processes#39700

Closed
winglian wants to merge 2 commits intohuggingface:mainfrom
winglian:tp-size-save
Closed

properly save model across tensor parallel processes#39700
winglian wants to merge 2 commits intohuggingface:mainfrom
winglian:tp-size-save

Conversation

@winglian
Copy link
Copy Markdown
Collaborator

@winglian winglian commented Jul 26, 2025

What does this PR do?

Tensor Parallel saving doesn't work because currently it would use the default of checking args.should_save which would only save on rank0, but we need to coordinate across TP ranks to gather the dtensors.

This is still in draft as I need to solve for FSDP + TP too.

@S1ro1

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker ArthurZucker added the for patch Tag issues / labels that should be included in the next patch label Jul 28, 2025
@S1ro1
Copy link
Copy Markdown
Contributor

S1ro1 commented Jul 28, 2025

Superseeded by #39693

@S1ro1 S1ro1 removed the for patch Tag issues / labels that should be included in the next patch label Jul 28, 2025
@winglian winglian closed this Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants