[zero] refactor low level zero for shard evenly by Gy-Lu · Pull Request #4030 · hpcaitech/ColossalAI

Gy-Lu · 2023-06-18T06:02:13Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

#3954

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

This PR has refactored low level zero for load balancing.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

Gy-Lu · 2023-06-28T06:41:20Z

The design is in #3954
And code polish is still on its way :<

kurisusnowdeng · 2023-06-28T09:04:10Z

The design is in #3954 And code polish is still on its way :<

This version is distributing each master parameter across devices instead of distributing the param list. Correct?

Just be aware of tensor precision during communication in terms of efficiency.

Gy-Lu · 2023-06-28T09:19:39Z

The design is in #3954 And code polish is still on its way :<

This version is distributing each master parameter across devices instead of distributing the param list. Correct?

Just be aware of tensor precision during communication in terms of efficiency.

Right.
Now the dtype of communication is the gradients' default dtype. It seems add an arg is better.

github-actions · 2023-06-29T13:05:36Z

The code coverage for the changed files is 83%.

Click me to view the complete report

Name                                                       Stmts   Miss  Cover
------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                          125     66    47%
colossalai/zero/low_level/bookkeeping/bucket_store.py         51      0   100%
colossalai/zero/low_level/bookkeeping/gradient_store.py       31      1    97%
colossalai/zero/low_level/bookkeeping/parameter_store.py      16      0   100%
colossalai/zero/low_level/low_level_optim.py                 285     20    93%
tests/test_zero/test_low_level/test_grad_acc.py               90     28    69%
tests/test_zero/test_low_level/test_zero1_2.py               102      1    99%
------------------------------------------------------------------------------
TOTAL                                                        700    116    83%

github-actions · 2023-06-30T04:09:54Z

The code coverage for the changed files is 75%.

Click me to view the complete report

Name                                                           Stmts   Miss  Cover
----------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                              125     69    45%
colossalai/zero/low_level/bookkeeping/bucket_store.py             51      4    92%
colossalai/zero/low_level/bookkeeping/gradient_store.py           31      2    94%
colossalai/zero/low_level/bookkeeping/parameter_store.py          16      1    94%
colossalai/zero/low_level/low_level_optim.py                     285     59    79%
tests/test_booster/test_plugin/test_low_level_zero_plugin.py      59      6    90%
----------------------------------------------------------------------------------
TOTAL                                                            567    141    75%

github-actions · 2023-06-30T06:53:48Z

The code coverage for the changed files is 84%.

Click me to view the complete report

Name                                                           Stmts   Miss  Cover
----------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                              125     66    47%
colossalai/zero/low_level/bookkeeping/bucket_store.py             51      0   100%
colossalai/zero/low_level/bookkeeping/gradient_store.py           31      1    97%
colossalai/zero/low_level/bookkeeping/parameter_store.py          16      0   100%
colossalai/zero/low_level/low_level_optim.py                     285     21    93%
tests/test_booster/test_plugin/test_low_level_zero_plugin.py      59      6    90%
tests/test_zero/test_low_level/test_grad_acc.py                   90     28    69%
tests/test_zero/test_low_level/test_zero1_2.py                   102      1    99%
----------------------------------------------------------------------------------
TOTAL                                                            759    123    84%

* refactor low level zero * fix zero2 and support cpu offload * avg gradient and modify unit test * refactor grad store, support layer drop * refactor bucket store, support grad accumulation * fix and update unit test of zero and ddp * compatible with tp, ga and unit test * fix memory leak and polish * add zero layer drop unittest * polish code * fix import err in unit test * support diffenert comm dtype, modify docstring style * polish code * test padding and fix * fix unit test of low level zero * fix pad recording in bucket store * support some models * polish

Gy-Lu added 6 commits June 18, 2023 13:51

refactor low level zero

98989a1

fix zero2 and support cpu offload

83c1f73

avg gradient and modify unit test

cc15f9e

refactor grad store, support layer drop

fce9f27

refactor bucket store, support grad accumulation

ad6a065

fix and update unit test of zero and ddp

a0a2e28

Gy-Lu changed the title ~~[refactor/zero] refactor low level zero for shard evenly~~ [zero] refactor low level zero for shard evenly Jun 26, 2023

Gy-Lu added 3 commits June 26, 2023 20:02

compatible with tp, ga and unit test

4e2a436

fix memory leak and polish

21dcf7d

add zero layer drop unittest

08c239c

Gy-Lu marked this pull request as ready for review June 28, 2023 06:40

polish code

9cf5f87

Gy-Lu added enhancement New feature or request and removed enhancement New feature or request labels Jun 28, 2023

fix import err in unit test

c4e004d

Gy-Lu self-assigned this Jun 28, 2023

Gy-Lu added 5 commits June 28, 2023 17:51

support diffenert comm dtype, modify docstring style

e571238

polish code

b6f7cd2

test padding and fix

ea4e3c5

fix unit test of low level zero

4ace8cb

fix pad recording in bucket store

b528f0f

Gy-Lu requested review from FrankLeeeee, kurisusnowdeng and ver217 June 29, 2023 13:06

support some models

c17ed06

ver217 reviewed Jun 30, 2023

View reviewed changes

Comment thread colossalai/zero/low_level/bookkeeping/bucket_store.py Outdated

ver217 reviewed Jun 30, 2023

View reviewed changes

Comment thread colossalai/zero/low_level/low_level_optim.py

polish

5ece073

ver217 approved these changes Jun 30, 2023

View reviewed changes

ver217 merged commit 4a82c4e into hpcaitech:feature/zero Jun 30, 2023

Gy-Lu deleted the llzero branch June 30, 2023 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero] refactor low level zero for shard evenly#4030

[zero] refactor low level zero for shard evenly#4030
ver217 merged 18 commits intohpcaitech:feature/zerofrom
Gy-Lu:llzero

Gy-Lu commented Jun 18, 2023 •

edited

Loading

Uh oh!

Gy-Lu commented Jun 28, 2023

Uh oh!

kurisusnowdeng commented Jun 28, 2023

Uh oh!

Gy-Lu commented Jun 28, 2023

Uh oh!

github-actions Bot commented Jun 29, 2023

Uh oh!

github-actions Bot commented Jun 30, 2023

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Gy-Lu commented Jun 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Gy-Lu commented Jun 28, 2023

Uh oh!

kurisusnowdeng commented Jun 28, 2023

Uh oh!

Gy-Lu commented Jun 28, 2023

Uh oh!

github-actions Bot commented Jun 29, 2023

Uh oh!

github-actions Bot commented Jun 30, 2023

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gy-Lu commented Jun 18, 2023 •

edited

Loading