You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 6, 2023. It is now read-only.
The hostname is sh-db2kkt73p534vd-sh-0-0 but Volcano gives the addresss sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.
🐛 Bug
Component (check all that applies):
state apitrain_step apitrain_looprendezvouscheckpointrollbackmetricspetctlexamplesdockerTo Reproduce
Steps to reproduce the behavior:
LOGLEVEL=INFO python -m torch.distributed.run --rdzv_backend c10d --rdzv_id 1 --rdzv_endpoint "$VC_SH_0_HOSTS" --nnodes 2 echo helloThe hostname is
sh-db2kkt73p534vd-sh-0-0but Volcano gives the addressssh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.https://github.com/pytorch/pytorch/blob/1b745efbe8ee0ac3bae594ea88ff27e71a734c88/torch/distributed/elastic/rendezvous/utils.py#L110
We may want to do a full dns resolution on the address and check if it matches any of the local IP addresses.
Expected behavior
It realizes the host name is the current node and starts the
c10dserver.Environment
conda,pip, source,docker): dockerAdditional context