[checkpointio]: General Checkpointing of Sharded Optimizers

### Describe the feature

According to the checkpoint system discussed in [#3399](https://github.com/hpcaitech/ColossalAI/discussions/3339), the saving/loading feature of sharded optimizers should be supported due to the potential huge size of an optimizer.  

As for the solution, @ver217 's latest response under  [#3399](https://github.com/hpcaitech/ColossalAI/discussions/3339) could be feasible. A state dict of an optimizer includes two parts: a dict of state & a dict of param_groups:
![228758472-f81020d5-9673-4648-8fbc-32d27fe9d346](https://github.com/hpcaitech/ColossalAI/assets/56809903/5d4b25d2-5a5c-4eb1-be07-2fd6c086d627)

So under the checkpoint folder for a sharded optimizer, the files might be categorized into three classes:
- An index file recording weight maps and metadata.
- A group file storing information of `param_groups`. Memory usage of `param_groups` is small so one single file is enough.
- State files containing optimizer states for parameters.  Optimizer state might be large so it should be sharded into different files.

This design is enough for general purposes. To deal with more complex scenarios such as pipeline parallel or auto parallel, future efforts on design will be needed.

To gain more details on this design, please refer to @ver217 's response under  [#3399](https://github.com/hpcaitech/ColossalAI/discussions/3339).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[checkpointio]: General Checkpointing of Sharded Optimizers #3961

Describe the feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[checkpointio]: General Checkpointing of Sharded Optimizers #3961

Description

Describe the feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions