2台服务器分布式跑resnet_split.py遇到无限等待的情况

**环境：** nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像的容器
**代码：** FastNN/resnet/resnet_split.py
**执行命令：**
服务器1：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh
服务器2：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh

服务器1的执行情况：
![image](https://github.com/alibaba/FastNN/assets/55943192/58d97a8f-fa61-4239-a70b-fd8d1c4ba58b)
服务器2的执行情况：
![image](https://github.com/alibaba/FastNN/assets/55943192/f8b7b791-98f6-4fb3-ac4f-3f979f64ee7f)

可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复，但是没有继续往下运行。
**补充：** 同样的环境可以分布式运行bert，服务器之间是可以正常连接跑分布式训练的。

想问下是我的执行问题还是代码需要进行修改？


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2台服务器分布式跑resnet_split.py遇到无限等待的情况 #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

2台服务器分布式跑resnet_split.py遇到无限等待的情况 #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions