Skip to content

[BUG]: ZeRO checkpoint would change the offload behavior #4528

@Gy-Lu

Description

@Gy-Lu

🐛 Describe the bug

In the current code, there is a lack of processing for offload when comes to checkpoints.
For example, with offload, the device of optimizer state should be CPU, which leads to an error when comm.

Besides, when loading the checkpoint, the device of the optimizer state should follow the original one rather than the loaded one. However, now it's just the opposite.

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions