C10d ipv6 network address cannot be retrieved error

arunppsg · October 2, 2023, 3:10pm

I am trying a multinode training using this example: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py . My infrastructure setup is two AWS g4dn.xlarge instances, each with single GPU and I invoked the scripts via torchrun.

On node 0, the script is invoked as torchrun --nproc-per-node=1 --nnodes=2 --node-rank=0 --rdzv-id=456 --rdzv-backend=c10d --rdzv-endpoint=172.16.130.32:16000 multinode.py 10 5 and on node 1, as torchrun --nproc-per-node=1 --nnodes=2 --node-rank=1 --rdzv-id=456 --rdzv-backend=c10d --rdzv-endpoint=172.16.130.32:16000 multinode.py 10 5

Here, 172.16.130.32 is the private ipv4 address of node0.

When I do the above, I get a error log as

master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[W socket.cpp:601] [c10d] The IPv6 network addresses of (ip-172-16-130-220, 52737) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
[W socket.cpp:601] [c10d] The IPv6 network addresses of (ip-172-16-130-220, 52737) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).

in node 0.

172.16.130.220 is the private ipv4 of node 2.

Any hints on resolving it will be helpful.

arunppsg · October 3, 2023, 8:05am

I didn’t enable DNS Resolution and DNS hostname in AWS VPC. After enabling them, it worked.

Kokul_Raj · March 4, 2024, 10:10am

I am using Mac M2, have following error
torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:697] [c10d] The IPv6 network addresses of (1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa, 51738) cannot be retrieved (gai error: 8 - nodename nor servname provided, or not known).

nrxsvzo · April 8, 2025, 3:31pm

For others that encounter this error, this blog may be helpful: Distributed Training with Pytorch | Hung-Yueh Chiang