Overview
As low level zero optimizer and gemini optimizer can not get standard full state dict now, we have to save state dict on every process.
And the optimizer state dict is different from the original one. That is to say, an optimizer checkpoint saved by low level zero optimizer or gemini optimizer can not be loaded by naive optimizer, and vice versa.
We have a workaround that saving optimizer state dict on each process and loading corresponding part on each process.
This should be fixed once those optimizers support getting full state dict.
Goal
- Update low level zero and gemini plugin. Saves optimizer checkpoint on each process with different name.
- Add warning info.
- Update unit test.
Overview
As low level zero optimizer and gemini optimizer can not get standard full state dict now, we have to save state dict on every process.
And the optimizer state dict is different from the original one. That is to say, an optimizer checkpoint saved by low level zero optimizer or gemini optimizer can not be loaded by naive optimizer, and vice versa.
We have a workaround that saving optimizer state dict on each process and loading corresponding part on each process.
This should be fixed once those optimizers support getting full state dict.
Goal