About stage 2 training

Hi, thank you for providing the opensource code.

While doing stage 2 training,
mpiexec -n 1 python scripts/train.py --latent_dim 64 --encoder_type resnet18 --log_dir log/stage2 --resume_checkpoint log/stage1/stage1_model050000.pt --data_dir peronsal_deca.lmdb --lr 1e-5 --p2_weight True --image_size 256 --batch_size 4 --max_steps 5000 --num_workers 8 --save_interval 5000 --stage 2

the code gave me an error:
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

It said that I have to change all of the in-place operation with not-in-place operation, or using torch.no_grad().

But it seems that it already uses torch.no_grad() option in sync_params() (it is where actually the error occurs).

def sync_params(params):
    """
    Synchronize a sequence of Tensors across ranks from rank 0.
    """
    for p in params:
        with th.no_grad():
            dist.broadcast(p, 0)
            
Can you give me some advice to mange this problem?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About stage 2 training #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About stage 2 training #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions