Skip to content

2台服务器分布式跑resnet_split.py遇到无限等待的情况 #14

@alphabewitch

Description

@alphabewitch

环境: nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像的容器
代码: FastNN/resnet/resnet_split.py
执行命令:
服务器1:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh
服务器2:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh

服务器1的执行情况:
image
服务器2的执行情况:
image

可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复,但是没有继续往下运行。
补充: 同样的环境可以分布式运行bert,服务器之间是可以正常连接跑分布式训练的。

想问下是我的执行问题还是代码需要进行修改?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions