Skip to content

[plugin] a workaround for zero plugins' optimizer checkpoint #3776

@ver217

Description

@ver217

Overview

As low level zero optimizer and gemini optimizer can not get standard full state dict now, we have to save state dict on every process.

And the optimizer state dict is different from the original one. That is to say, an optimizer checkpoint saved by low level zero optimizer or gemini optimizer can not be loaded by naive optimizer, and vice versa.

We have a workaround that saving optimizer state dict on each process and loading corresponding part on each process.

This should be fixed once those optimizers support getting full state dict.

Goal

  1. Update low level zero and gemini plugin. Saves optimizer checkpoint on each process with different name.
  2. Add warning info.
  3. Update unit test.

Metadata

Metadata

Assignees

Labels

APIrelated to API changesbugSomething isn't working

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions