Describe the feature
According to the checkpoint system discussed in #3399, the saving/loading feature of sharded optimizers should be supported due to the potential huge size of an optimizer.
As for the solution, @ver217 's latest response under #3399 could be feasible. A state dict of an optimizer includes two parts: a dict of state & a dict of param_groups:

So under the checkpoint folder for a sharded optimizer, the files might be categorized into three classes:
- An index file recording weight maps and metadata.
- A group file storing information of
param_groups. Memory usage of param_groups is small so one single file is enough.
- State files containing optimizer states for parameters. Optimizer state might be large so it should be sharded into different files.
This design is enough for general purposes. To deal with more complex scenarios such as pipeline parallel or auto parallel, future efforts on design will be needed.
To gain more details on this design, please refer to @ver217 's response under #3399.
Describe the feature
According to the checkpoint system discussed in #3399, the saving/loading feature of sharded optimizers should be supported due to the potential huge size of an optimizer.
As for the solution, @ver217 's latest response under #3399 could be feasible. A state dict of an optimizer includes two parts: a dict of state & a dict of param_groups:

So under the checkpoint folder for a sharded optimizer, the files might be categorized into three classes:
param_groups. Memory usage ofparam_groupsis small so one single file is enough.This design is enough for general purposes. To deal with more complex scenarios such as pipeline parallel or auto parallel, future efforts on design will be needed.
To gain more details on this design, please refer to @ver217 's response under #3399.