I am trying a multinode training using this example: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py . My infrastructure setup is two AWS g4dn.xlarge instances, each with single GPU and I invoked the scripts via torchrun
.
On node 0, the script is invoked as torchrun --nproc-per-node=1 --nnodes=2 --node-rank=0 --rdzv-id=456 --rdzv-backend=c10d --rdzv-endpoint=172.16.130.32:16000 multinode.py 10 5
and on node 1, as torchrun --nproc-per-node=1 --nnodes=2 --node-rank=1 --rdzv-id=456 --rdzv-backend=c10d --rdzv-endpoint=172.16.130.32:16000 multinode.py 10 5
Here, 172.16.130.32
is the private ipv4 address of node0.
When I do the above, I get a error log as
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[W socket.cpp:601] [c10d] The IPv6 network addresses of (ip-172-16-130-220, 52737) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (ip-172-16-130-220, 52737) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
in node 0.
172.16.130.220
is the private ipv4 of node 2.
Any hints on resolving it will be helpful.