Skip to content

[BUG]: gpc.set_seed() doesn't work for multi-node training  #2380

@haofanwang

Description

@haofanwang

🐛 Describe the bug

Describe the bug

I'm working on the dreambooth example.

gpc.set_seed(args.seed) will cause following error.

AssertionError: The seed for ParallelMode.DATA has been added

if I directly comment out it, everything works fine.

To Reproduce

I'm using below command to support further multi-machine training. But while I'm testing, I only run on a single 8*V100 machine.

python -m torch.distributed.run --nproc_per_node=$GPU_NUM --nnodes=$WORLD_SIZE \
  --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ./train_dreambooth_colossalai.py

Expected behavior
The seed should be rightly set.

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions