Skip to content

[zero] refactor low level zero for shard evenly#4030

Merged
ver217 merged 18 commits intohpcaitech:feature/zerofrom
Gy-Lu:llzero
Jun 30, 2023
Merged

[zero] refactor low level zero for shard evenly#4030
ver217 merged 18 commits intohpcaitech:feature/zerofrom
Gy-Lu:llzero

Conversation

@Gy-Lu
Copy link
Copy Markdown
Contributor

@Gy-Lu Gy-Lu commented Jun 18, 2023

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

#3954

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

This PR has refactored low level zero for load balancing.

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@Gy-Lu Gy-Lu changed the title [refactor/zero] refactor low level zero for shard evenly [zero] refactor low level zero for shard evenly Jun 26, 2023
@Gy-Lu Gy-Lu marked this pull request as ready for review June 28, 2023 06:40
@Gy-Lu
Copy link
Copy Markdown
Contributor Author

Gy-Lu commented Jun 28, 2023

The design is in #3954
And code polish is still on its way :<

@Gy-Lu Gy-Lu added enhancement New feature or request and removed enhancement New feature or request labels Jun 28, 2023
@Gy-Lu Gy-Lu self-assigned this Jun 28, 2023
@kurisusnowdeng
Copy link
Copy Markdown
Contributor

The design is in #3954 And code polish is still on its way :<

This version is distributing each master parameter across devices instead of distributing the param list. Correct?

Just be aware of tensor precision during communication in terms of efficiency.

@Gy-Lu
Copy link
Copy Markdown
Contributor Author

Gy-Lu commented Jun 28, 2023

The design is in #3954 And code polish is still on its way :<

This version is distributing each master parameter across devices instead of distributing the param list. Correct?

Just be aware of tensor precision during communication in terms of efficiency.

Right.
Now the dtype of communication is the gradients' default dtype. It seems add an arg is better.

@github-actions
Copy link
Copy Markdown
Contributor

The code coverage for the changed files is 83%.

Click me to view the complete report
Name                                                       Stmts   Miss  Cover
------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                          125     66    47%
colossalai/zero/low_level/bookkeeping/bucket_store.py         51      0   100%
colossalai/zero/low_level/bookkeeping/gradient_store.py       31      1    97%
colossalai/zero/low_level/bookkeeping/parameter_store.py      16      0   100%
colossalai/zero/low_level/low_level_optim.py                 285     20    93%
tests/test_zero/test_low_level/test_grad_acc.py               90     28    69%
tests/test_zero/test_low_level/test_zero1_2.py               102      1    99%
------------------------------------------------------------------------------
TOTAL                                                        700    116    83%

@github-actions
Copy link
Copy Markdown
Contributor

The code coverage for the changed files is 75%.

Click me to view the complete report
Name                                                           Stmts   Miss  Cover
----------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                              125     69    45%
colossalai/zero/low_level/bookkeeping/bucket_store.py             51      4    92%
colossalai/zero/low_level/bookkeeping/gradient_store.py           31      2    94%
colossalai/zero/low_level/bookkeeping/parameter_store.py          16      1    94%
colossalai/zero/low_level/low_level_optim.py                     285     59    79%
tests/test_booster/test_plugin/test_low_level_zero_plugin.py      59      6    90%
----------------------------------------------------------------------------------
TOTAL                                                            567    141    75%

Comment thread colossalai/zero/low_level/bookkeeping/bucket_store.py Outdated
Comment thread colossalai/zero/low_level/low_level_optim.py
@github-actions
Copy link
Copy Markdown
Contributor

The code coverage for the changed files is 84%.

Click me to view the complete report
Name                                                           Stmts   Miss  Cover
----------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                              125     66    47%
colossalai/zero/low_level/bookkeeping/bucket_store.py             51      0   100%
colossalai/zero/low_level/bookkeeping/gradient_store.py           31      1    97%
colossalai/zero/low_level/bookkeeping/parameter_store.py          16      0   100%
colossalai/zero/low_level/low_level_optim.py                     285     21    93%
tests/test_booster/test_plugin/test_low_level_zero_plugin.py      59      6    90%
tests/test_zero/test_low_level/test_grad_acc.py                   90     28    69%
tests/test_zero/test_low_level/test_zero1_2.py                   102      1    99%
----------------------------------------------------------------------------------
TOTAL                                                            759    123    84%

@ver217 ver217 merged commit 4a82c4e into hpcaitech:feature/zero Jun 30, 2023
@Gy-Lu Gy-Lu deleted the llzero branch June 30, 2023 09:30
ver217 pushed a commit to ver217/ColossalAI that referenced this pull request Jul 13, 2023
* refactor low level zero

* fix zero2 and support cpu offload

* avg gradient and modify unit test

* refactor grad store, support layer drop

* refactor bucket store, support grad accumulation

* fix and update unit test of zero and ddp

* compatible with tp, ga and unit test

* fix memory leak and polish

* add zero layer drop unittest

* polish code

* fix import err in unit test

* support diffenert comm dtype, modify docstring style

* polish code

* test padding and fix

* fix unit test of low level zero

* fix pad recording in bucket store

* support some models

* polish
ver217 pushed a commit that referenced this pull request Jul 31, 2023
* refactor low level zero

* fix zero2 and support cpu offload

* avg gradient and modify unit test

* refactor grad store, support layer drop

* refactor bucket store, support grad accumulation

* fix and update unit test of zero and ddp

* compatible with tp, ga and unit test

* fix memory leak and polish

* add zero layer drop unittest

* polish code

* fix import err in unit test

* support diffenert comm dtype, modify docstring style

* polish code

* test padding and fix

* fix unit test of low level zero

* fix pad recording in bucket store

* support some models

* polish
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants