Skip to content

[BUG]: backend for supporting distributed training #2407

@haofanwang

Description

@haofanwang

🐛 Describe the bug

Describe the bug
I'm training with the dreambooth example with RDMA. But current backend (gloo) doesn't support it. It leads to following error.

RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:799] connect 
[127.0.1.1]:29591: Connection refused

I check the corresponding function, and I guess this line is a typo?

old:
cpu_group = dist.new_group(ranks, backend='gloo') if dist.get_backend() != 'gloo' else None
new:
cpu_group = dist.new_group(ranks, backend='gloo') if dist.get_backend() == 'gloo' else None

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions