Skip to content

[shardformer] support sharded checkpointing IO for hybrid parallel plugin #4477

@Fridge003

Description

@Fridge003

Support checkpointing IO for hybrid parallel plugin, should handle tp/pp/zero properly. First complete the sharded case for both model and optimizer.

The feature to be implemented should include:

  • Sharded saving of model: saving model to multiple files under the same checkpoint directory.
  • Sharded loading of model: loading model from multiple files under the same checkpoint directory.
  • Sharded saving of optimizer: saving optimizer to multiple files under the same checkpoint directory.
  • Sharded loading of optimizer: loading optimizer from multiple files under the same checkpoint directory.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions