Hi, thank you for providing the opensource code.
While doing stage 2 training,
mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2
the code gave me an error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().
But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).
def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)
Can you give me some advice to mange this problem?
Thank you.
Hi, thank you for providing the opensource code.
While doing stage 2 training,
mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2
the code gave me an error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().
But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).
def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)
Can you give me some advice to mange this problem?
Thank you.