Skip to content

[checkpointio] gather tensor before unpad it if the tensor is both padded and distributed#6168

Merged
ver217 merged 1 commit intohpcaitech:mainfrom
Lemon-412:main
Jan 21, 2025
Merged

[checkpointio] gather tensor before unpad it if the tensor is both padded and distributed#6168
ver217 merged 1 commit intohpcaitech:mainfrom
Lemon-412:main

Conversation

@Lemon-412
Copy link
Copy Markdown
Contributor

@Lemon-412 Lemon-412 commented Dec 24, 2024

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

fixed #6167

📝 What does this PR do?

To prevent #6167, gather the tensor before unpad it if the tensor is both padded and distributed.
Clipboard_Screenshot_1735094244

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@Lemon-412 Lemon-412 requested a review from a team as a code owner December 24, 2024 09:50
@Lemon-412 Lemon-412 changed the title [checkpoint_io] [checkpoint_io] gather tensor before unpad it if the tensor is both padded and distributed Dec 24, 2024
@Lemon-412
Copy link
Copy Markdown
Contributor Author

request review from @flybird11111 @ver217 .

@Issues-translate-bot
Copy link
Copy Markdown

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


request review from @flybird11111 @ver217 .

@Lemon-412 Lemon-412 changed the title [checkpoint_io] gather tensor before unpad it if the tensor is both padded and distributed [checkpointio] gather tensor before unpad it if the tensor is both padded and distributed Jan 20, 2025
@Lemon-412
Copy link
Copy Markdown
Contributor Author

it seems like one unittest (test_dist_lamb) fails with OOM.
weird, should we rerun the test?

@ver217
Copy link
Copy Markdown
Contributor

ver217 commented Jan 20, 2025

it seems like one unittest (test_dist_lamb) fails with OOM. weird, should we rerun the test?

This was fixed in main branch. Could you rebase the main branch?

@Lemon-412
Copy link
Copy Markdown
Contributor Author

it seems like one unittest (test_dist_lamb) fails with OOM. weird, should we rerun the test?

This was fixed in main branch. Could you rebase the main branch?

Done. Approval is required again since we force push the code using rebase.

@ver217 ver217 merged commit 97e60cb into hpcaitech:main Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Size Mismatch Issue When Loading Model Checkpoints Trained with Tensor Parallel if vocab_size % tp_size != 0

3 participants