[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. by yhna940 · Pull Request #3173 · hpcaitech/ColossalAI

yhna940 · 2023-03-17T14:45:52Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

N/A

📝 What does this PR do?

It seems that the variable names related to the mixed precision parameter group do not comprehensively cover its characteristics, so I suggest a few changes. These changes are very trivial, but hopefully they will alleviate some of the confusion for beginners like me.

Currently, the entire parameter group is named fp16_param_groups, and the parts managed by the gpu at the current rank are described as fp32_flat_param_groups_of_current_rank. This state perfectly represents the characteristics when the master weight is a half-tensor or the dtype specified in the __init__ method is fp16. In other cases, however, its characteristics do not correspond to the variable it. So I want it to be renamed according to the sharding state, not the data type, according to the fsdp convention of pytorch. (with names like flatten_sharded_optim_state_dict and full_optim_state_dict).

This is a related but more trivial issue, but it seems that the param_store methods don't even need to specify fp16.

Thank you :)

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

1SAA · 2023-03-21T04:01:55Z

Hi @yhna940

Thanks for your contribution. But the way you named is little confusing. For each param appeared in the orginal code, it has a prefix, such as fp16 and fp32, while you named three types of param. They are param, full_param, and sharded_param. I suggest that we should use original_param and master_weight to replace fp16 and fp32. What do you think?

github-actions · 2023-03-21T05:13:59Z

The code coverage for the changed files is 85%.

Click me to view the complete report

Name                                                           Stmts   Miss  Cover
----------------------------------------------------------------------------------
colossalai/zero/sharded_optim/_utils.py                          125     48    62%
colossalai/zero/sharded_optim/bookkeeping/parameter_store.py      48      0   100%
colossalai/zero/sharded_optim/low_level_optim.py                 311     25    92%
----------------------------------------------------------------------------------
TOTAL                                                            484     73    85%

yhna940 · 2023-03-24T04:42:09Z

Hi @yhna940

Thanks for your contribution. But the way you named is little confusing. For each param appeared in the orginal code, it has a prefix, such as fp16 and fp32, while you named three types of param. They are param, full_param, and sharded_param. I suggest that we should use original_param and master_weight to replace fp16 and fp32. What do you think?

@1SAA

Thank you for your feedback. I understand the concern about the naming conventions and appreciate your suggestion. However, I would like to propose an alternative term, working_param, instead of original_param. The term working_param is more closely related to the concept of the mixed-precision training context. It emphasizes the fact that these are the parameters that are actively used during forward and backward pass computations. Using working_param and master_weight would create a clear distinction between the two types of parameters and help avoid confusion.

I hope this explanation clarifies my reasoning for suggesting the term working_param. Please let me know if you have any concerns or if you'd like to discuss this further.

To summarize my suggestions:

fp16 -> working
fp32 -> master

yhna940 · 2023-03-28T15:07:04Z

Hi @1SAA

Based on our previous discussions, I have renamed the variables that were causing confusion related to Mixed Precision. Could you please review them once again? Thank you!

binmakeswell · 2023-04-06T06:01:42Z

Hi @yhna940 Thanks for your contribution, but there are some conflicts in this PR. Could you please solve them first? Thanks.

yhna940 · 2023-04-07T00:24:12Z

Hi @yhna940 Thanks for your contribution, but there are some conflicts in this PR. Could you please solve them first? Thanks.

Hello @binmakeswell , the conflict has been resolved. thank you

1SAA · 2023-04-07T03:19:30Z

Hi @yhna940

I am willing to merge your pr once tests in CI are passed.

yhna940 · 2023-04-07T05:19:08Z

Hi @yhna940

I am willing to merge your pr once tests in CI are passed.

Hi @1SAA
The Github action CI Test pipeline failed, but it succeeded in my environment. Can you check this out? Thank you :)

Github Action CI Logs

974
=========================== short test summary info ============================
[975](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:976)
FAILED tests/test_booster/test_plugin/test_gemini_plugin.py::test_gemini_plugin - torch.multiprocessing.spawn.ProcessRaisedException: 
[976](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:977)

[977](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:978)
-- Process 0 terminated with the following error:
[978](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:979)
Traceback (most recent call last):
[979](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:980)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/chunk/manager.py", line 64, in register_tensor
[980](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:981)
    chunk_group[-1].append_tensor(tensor)
[981](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:982)
IndexError: deque index out of range
[982](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:983)

[983](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:984)
During handling of the above exception, another exception occurred:
[984](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:985)

[985](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:986)
Traceback (most recent call last):
[986](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:987)
  File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
[987](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:988)
    fn(i, *args)
[988](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:989)
  File "/__w/ColossalAI/ColossalAI/tests/test_booster/test_plugin/test_gemini_plugin.py", line 112, in run_dist
[989](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:990)
    check_gemini_plugin(early_stop=early_stop)
[990](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:991)
  File "/__w/ColossalAI/ColossalAI/tests/test_booster/test_plugin/test_gemini_plugin.py", line 74, in check_gemini_plugin
[991](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:992)
    raise e
[992](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:993)
  File "/__w/ColossalAI/ColossalAI/tests/test_booster/test_plugin/test_gemini_plugin.py", line 56, in check_gemini_plugin
[993](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:994)
    model, optimizer, criterion, _, _ = booster.boost(model, optimizer, criterion)
[994](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:995)
  File "/__w/ColossalAI/ColossalAI/colossalai/booster/booster.py", line 118, in boost
[995](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:996)
    model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(
[996](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:997)
  File "/__w/ColossalAI/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 328, in configure
[997](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:998)
    model = GeminiModel(model, self.gemini_config)
[998](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:999)
  File "/__w/ColossalAI/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 118, in __init__
[999](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1000)
    self.module = zero_model_wrapper(module, zero_stage=3, gemini_config=gemini_config)
[1000](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1001)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/wrapper.py", line 43, in zero_model_wrapper
[1001](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1002)
    wrapped_model = GeminiDDP(model, **gemini_config)
[1002](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1003)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/gemini_ddp.py", line 590, in __init__
[1003](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1004)
    super().__init__(module, gemini_manager, pin_memory, force_outputs_fp32, strict_ddp_mode)
[1004](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1005)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/gemini_ddp.py", line 83, in __init__
[1005](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1006)
    self._init_chunks(param_order=param_order,
[1006](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1007)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/gemini_ddp.py", line 511, in _init_chunks
[1007](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1008)
    self.chunk_manager.register_tensor(tensor=fp32_p,
[1008](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1009)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/chunk/manager.py", line 79, in register_tensor
[1009](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1010)
    chunk = Chunk(
[1010](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1011)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/chunk/chunk.py", line 102, in __init__
[1011](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1012)
    self.chunk_temp = torch.zeros(chunk_size, dtype=dtype, device=device)    # keep all zero
[1012](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1013)
RuntimeError: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 9.78 GiB total capacity; 6.30 GiB already allocated; 55.31 MiB free; 6.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[1013](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1014)
==== 1 failed, 240 passed, 175 skipped, 121 warnings in 1455.24s (0:24:15) =====
[1014](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1015)
Error: Process completed with exit code 1.

Local Test Log

python3 -m pytest -v tests/test_booster/test_plugin/test_gemini_plugin.py
========================================================= test session starts =========================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.0.0 -- /fsx/home-yhna/0407/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/fsx/home-yhna/0407/ColossalAI/.hypothesis/examples')
rootdir: /fsx/home-yhna/0407/ColossalAI, configfile: pytest.ini
plugins: cov-4.0.0, hypothesis-6.70.2
collected 1 item

tests/test_booster/test_plugin/test_gemini_plugin.py::test_gemini_plugin

PASSED                                                 [100%]

==================================================== 1 passed in 262.90s (0:04:22) ====================================================

1SAA · 2023-04-10T07:39:26Z

Hi @yhna940

It seems like there exists a memory leak in our tests. I think this problem is not caused by your code. I will fix this error soon.

ver217 · 2023-04-26T09:59:17Z

@yhna940 Can you sync our main branch updates first? We've already fixed the CI test. So after this, we can merge your this PR. Thanks.

yhna940 · 2023-04-26T13:48:22Z

Can you sync our main branch updates first? We've already fixed the CI test. So after this, we can merge your this PR. Thanks.

@ver217 I have synced the main branch updates to this PR. Thanks for letting me know that the CI test issue has been resolved.

github-actions · 2023-04-27T03:14:31Z

The code coverage for the changed files is 86%.

Click me to view the complete report

Name                                                       Stmts   Miss  Cover
------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                          125     47    62%
colossalai/zero/low_level/bookkeeping/parameter_store.py      48      0   100%
colossalai/zero/low_level/low_level_optim.py                 313     20    94%
------------------------------------------------------------------------------
TOTAL                                                        486     67    86%

…O optimizer (#183) ## Title - [zero] Suggests a minor change to confusing variable names in the ZeRO optimizer ## Description It seems that the variable names related to the mixed precision parameter group do not comprehensively cover its characteristics, so I suggest a few changes. These changes are very trivial, but hopefully they will alleviate some of the confusion for beginners like me. Currently, the entire parameter group is named `fp16_param_groups`, and the parts managed by the gpu at the current rank are described as `fp32_flat_param_groups_of_current_rank`. This state perfectly represents the characteristics when the master weight is a half-tensor or the dtype specified in the `__init__`method is fp16. In other cases, however, its characteristics do not correspond to the variable it. I would like to propose an alternative term, `working_param` and `master_param`. The term is more closely related to the concept of the mixed-precision training context. Using `working_param` and `master_weight` would create a clear distinction between the two types of parameters and help avoid confusion. To summarize my suggestions: - `fp16` -> `working` - `fp32` -> `master` ## Linked Issues - N/A ## Reference - hpcaitech/ColossalAI#3173

yhna40 and others added 4 commits March 7, 2023 00:10

Fix confusing variable name in zero opt

56c2d93

Merge remote-tracking branch 'origin/main' into feature/zero-refact

e011a0b

Apply lint

f59439a

Fix util func

68b2fba

binmakeswell requested a review from 1SAA March 20, 2023 02:53

binmakeswell added the Run Build and Test label Mar 20, 2023

Fix minor util func

2c46bbe

yhna940 closed this Mar 21, 2023

yhna940 reopened this Mar 21, 2023

Fix zero param optimizer name

46133bc

Merge remote-tracking branch 'origin/main' into feature/zero-refact

6ee8816

1SAA approved these changes Apr 10, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into feature/zero-refact

0898430

ver217 merged commit a22407c into hpcaitech:main Apr 27, 2023

yhna940 mentioned this pull request May 8, 2023

[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer EleutherAI/oslo#183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer.#3173

[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer.#3173
ver217 merged 8 commits intohpcaitech:mainfrom
yhna940:feature/zero-refact

yhna940 commented Mar 17, 2023

Uh oh!

1SAA commented Mar 21, 2023

Uh oh!

github-actions Bot commented Mar 21, 2023

Uh oh!

yhna940 commented Mar 24, 2023

Uh oh!

yhna940 commented Mar 28, 2023

Uh oh!

binmakeswell commented Apr 6, 2023

Uh oh!

yhna940 commented Apr 7, 2023

Uh oh!

1SAA commented Apr 7, 2023

Uh oh!

yhna940 commented Apr 7, 2023 •

edited

Loading

Uh oh!

1SAA commented Apr 10, 2023

Uh oh!

ver217 commented Apr 26, 2023 •

edited by binmakeswell

Loading

Uh oh!

yhna940 commented Apr 26, 2023

Uh oh!

github-actions Bot commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yhna940 commented Mar 17, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

1SAA commented Mar 21, 2023

Uh oh!

github-actions Bot commented Mar 21, 2023

Uh oh!

yhna940 commented Mar 24, 2023

Uh oh!

yhna940 commented Mar 28, 2023

Uh oh!

binmakeswell commented Apr 6, 2023

Uh oh!

yhna940 commented Apr 7, 2023

Uh oh!

1SAA commented Apr 7, 2023

Uh oh!

yhna940 commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Github Action CI Logs

Local Test Log

Uh oh!

1SAA commented Apr 10, 2023

Uh oh!

ver217 commented Apr 26, 2023 • edited by binmakeswell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhna940 commented Apr 26, 2023

Uh oh!

github-actions Bot commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yhna940 commented Apr 7, 2023 •

edited

Loading

ver217 commented Apr 26, 2023 •

edited by binmakeswell

Loading