Skip to content

[checkpointio]: General Checkpointing of Sharded Optimizers #3961

@Fridge003

Description

@Fridge003

Describe the feature

According to the checkpoint system discussed in #3399, the saving/loading feature of sharded optimizers should be supported due to the potential huge size of an optimizer.

As for the solution, @ver217 's latest response under #3399 could be feasible. A state dict of an optimizer includes two parts: a dict of state & a dict of param_groups:
228758472-f81020d5-9673-4648-8fbc-32d27fe9d346

So under the checkpoint folder for a sharded optimizer, the files might be categorized into three classes:

  • An index file recording weight maps and metadata.
  • A group file storing information of param_groups. Memory usage of param_groups is small so one single file is enough.
  • State files containing optimizer states for parameters. Optimizer state might be large so it should be sharded into different files.

This design is enough for general purposes. To deal with more complex scenarios such as pipeline parallel or auto parallel, future efforts on design will be needed.

To gain more details on this design, please refer to @ver217 's response under #3399.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions