🐛 Describe the bug
Describe the bug
I'm training with the dreambooth example with RDMA. But current backend (gloo) doesn't support it. It leads to following error.
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:799] connect
[127.0.1.1]:29591: Connection refused
I check the corresponding function, and I guess this line is a typo?
old:
cpu_group = dist.new_group(ranks, backend='gloo') if dist.get_backend() != 'gloo' else None
new:
cpu_group = dist.new_group(ranks, backend='gloo') if dist.get_backend() == 'gloo' else None
Environment
No response
🐛 Describe the bug
Describe the bug
I'm training with the dreambooth example with RDMA. But current backend (gloo) doesn't support it. It leads to following error.
I check the corresponding function, and I guess this line is a typo?
old:
cpu_group = dist.new_group(ranks, backend='gloo') if dist.get_backend() != 'gloo' else Nonenew:
cpu_group = dist.new_group(ranks, backend='gloo') if dist.get_backend() == 'gloo' else NoneEnvironment
No response