🐛 Describe the bug
When I use colossalai CLI with 2 node, I got an error "rank 8 and rank 0 both on CUDA device d000"
I have examined my scripts and command. And torchrun works well.
The error msg is:
Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=192.168.0.64 --master_port=29500 benchmark.py -c 7b --plugin zero --zero 1 -l 2048 -g -b 10 on 192.168.0.64, is localhost: True, exception: I/O operation on closed file
Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=192.168.0.64 --master_port=29500 benchmark.py -c 7b --plugin zero --zero 1 -l 2048 -g -b 10 on 192.168.0.189, is localhost: True, exception: I/O operation on closed file
Environment
No response
🐛 Describe the bug
When I use colossalai CLI with 2 node, I got an error "rank 8 and rank 0 both on CUDA device d000"
I have examined my scripts and command. And torchrun works well.
The error msg is:
Environment
No response