Support checkpointing IO for hybrid parallel plugin, should handle tp/pp/zero properly. First complete the sharded case for both model and optimizer.
The feature to be implemented should include:
- Sharded saving of model: saving model to multiple files under the same checkpoint directory.
- Sharded loading of model: loading model from multiple files under the same checkpoint directory.
- Sharded saving of optimizer: saving optimizer to multiple files under the same checkpoint directory.
- Sharded loading of optimizer: loading optimizer from multiple files under the same checkpoint directory.
Support checkpointing IO for hybrid parallel plugin, should handle tp/pp/zero properly. First complete the sharded case for both model and optimizer.
The feature to be implemented should include: