Skip to content

[BUG]: Training process blocks on gc.collect #4393

@HAOCHENYE

Description

@HAOCHENYE

🐛 Describe the bug

My training process was blocked when saving an optimizer state by Booster.save_optimizer. The program was blocked here:

Although I've no idea why the program was blocked at gc.collect, I believe that using gc.collect here is to free the storage as fast as possible. So, why don't we use the free_storage defined here:

def free_storage(tensor: torch.Tensor) -> None:
)

Replacing gc.collect with free_storage solves my problem perfectly.

If replacing gc.collect with free_storage is reasonable, I'd be happy to create a PR to fix it.

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions