🐛 Describe the bug
My training process was blocked when saving an optimizer state by Booster.save_optimizer. The program was blocked here:
Although I've no idea why the program was blocked at gc.collect, I believe that using gc.collect here is to free the storage as fast as possible. So, why don't we use the free_storage defined here:
|
def free_storage(tensor: torch.Tensor) -> None: |
)
Replacing gc.collect with free_storage solves my problem perfectly.
If replacing gc.collect with free_storage is reasonable, I'd be happy to create a PR to fix it.
Environment
No response
🐛 Describe the bug
My training process was blocked when saving an optimizer state by
Booster.save_optimizer. The program was blocked here:ColossalAI/colossalai/zero/gemini/gemini_optimizer.py
Line 474 in 089c365
Although I've no idea why the program was blocked at
gc.collect, I believe that usinggc.collecthere is to free the storage as fast as possible. So, why don't we use thefree_storagedefined here:ColossalAI/colossalai/zero/gemini/chunk/chunk.py
Line 43 in 089c365
Replacing
gc.collectwithfree_storagesolves my problem perfectly.If replacing
gc.collectwithfree_storageis reasonable, I'd be happy to create a PR to fix it.Environment
No response