Skip to content

[checkpointio] General Checkpointing of Sharded Optimizers#3984

Merged
ver217 merged 1 commit intohpcaitech:developfrom
Fridge003:feature/optimizer-checkpoint
Jun 15, 2023
Merged

[checkpointio] General Checkpointing of Sharded Optimizers#3984
ver217 merged 1 commit intohpcaitech:developfrom
Fridge003:feature/optimizer-checkpoint

Conversation

@Fridge003
Copy link
Copy Markdown
Contributor

@Fridge003 Fridge003 commented Jun 14, 2023

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

fixed #3951
fixed #3961

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

As is discussed in #3961, I implemented saving/loading feature of sharded optimizers for GeneralCheckpointIO class.
Broadly speaking, the implementation logic is quite similar to saving/loading sharded models. On top of that, I did a lot of arrangements to adapt the design of sharded models to sharded optimizers. To avoid OOM errors during loading sharded optimizers, I borrowed the implementation of Optimizer.load_state_dict() in source code of pytorch and modified it to our cases. This feature for specific plugins will be developed in the future.

A minor fix: Argument 'variant' and 'prefix' are of same use in the code, so I rename all of the 'variant' to 'prefix' to make our code more readable.

Relevant tests are added under tests/test_checkpoint_io/test_general_checkpoint_io.py.

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@Fridge003 Fridge003 requested review from flybird11111 and ver217 June 14, 2023 07:59
@flybird11111 flybird11111 added API related to API changes testing related to our testing labels Jun 15, 2023
@ver217 ver217 changed the title [checkpointio]: General Checkpointing of Sharded Optimizers [checkpointio] General Checkpointing of Sharded Optimizers Jun 15, 2023
@ver217 ver217 merged commit c9cff7e into hpcaitech:develop Jun 15, 2023
ver217 pushed a commit to ver217/ColossalAI that referenced this pull request Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API related to API changes testing related to our testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants