I am trying to use Torch RPC for communication between two nodes in different AWS regions but get the following error on the client node.
Connecting to 13.57.23.79
sdTraceback (most recent call last):
File "/home/ubuntu/examples/distributed/rpc/rl/main.py", line 250, in <module>
main()
File "/home/ubuntu/examples/distributed/rpc/rl/main.py", line 247, in main
run_worker(1,args.world_size)
File "/home/ubuntu/examples/distributed/rpc/rl/main.py", line 240, in run_worker
rpc.init_rpc(OBSERVER_NAME.format(rank), rank=rank, world_size=world_size)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: [/opt/conda/conda-bld/pytorch_1656352657443/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [172.31.15.149]:55687: Connection timed out
Here 13.57.23.79 is the public address of the master node and 172.31.15.149 is its private ip address. The problem seems to be after the initial connection to the master node using the public address, client retrieves the private ip of the master node and tries to continue the communication through the private ip.
Here is the output of ifconfig for master node:
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:c9:41:ba:33 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 172.31.15.149 netmask 255.255.240.0 broadcast 172.31.15.255
inet6 fe80::4bc:6dff:fe66:2c93 prefixlen 64 scopeid 0x20<link>
ether 06:bc:6d:66:2c:93 txqueuelen 1000 (Ethernet)
RX packets 990921 bytes 559805640 (559.8 MB)
RX errors 0 dropped 15 overruns 0 frame 0
TX packets 740812 bytes 75336825 (75.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 233712 bytes 26541723 (26.5 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 233712 bytes 26541723 (26.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens5 is the interface that is used both for public and private communication, and I feel like ProcessGroupGloo just retrieves the IP address of this interface (172.31.15.149), ignoring the fact it is not accessible publicly.
Is there a way to force ProcessGroupGloo to only use the exact IP I give it as MASTER_ADDRESS (here 13.57.23.79).
As problem seems to boil down to ProcessGroupGloo I use the following simple code for debug which gives the exact same error.
import os
import torch.distributed as dist
import sys
rank = int(sys.argv[1])
os.environ['MASTER_ADDR'] = '13.57.23.79'
os.environ['MASTER_PORT'] = '29550'
print(f"[ {os.getpid()} ] Initializing process group")
dist.init_process_group(backend="gloo", rank=rank, world_size=2)
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")