Skip to content

About stage 2 training #13

@Rayiz3

Description

@Rayiz3

Hi, thank you for providing the opensource code.

While doing stage 2 training,
mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2

the code gave me an error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().

But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)

Can you give me some advice to mange this problem?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions