Issues when running Pytorch RPC across AWS regions

chaoyanghe · August 19, 2022, 5:32pm

I am trying to use Torch RPC for communication between two nodes in different AWS regions but get the following error on the client node.

Connecting to 13.57.23.79
sdTraceback (most recent call last):
  File "/home/ubuntu/examples/distributed/rpc/rl/main.py", line 250, in <module>
    main()
  File "/home/ubuntu/examples/distributed/rpc/rl/main.py", line 247, in main
    run_worker(1,args.world_size)
  File "/home/ubuntu/examples/distributed/rpc/rl/main.py", line 240, in run_worker
    rpc.init_rpc(OBSERVER_NAME.format(rank), rank=rank, world_size=world_size)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
    group = _init_process_group(store, rank, world_size)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
    group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: [/opt/conda/conda-bld/pytorch_1656352657443/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [172.31.15.149]:55687: Connection timed out

Here 13.57.23.79 is the public address of the master node and 172.31.15.149 is its private ip address. The problem seems to be after the initial connection to the master node using the public address, client retrieves the private ip of the master node and tries to continue the communication through the private ip.
Here is the output of ifconfig for master node:

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:c9:41:ba:33  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 172.31.15.149  netmask 255.255.240.0  broadcast 172.31.15.255
        inet6 fe80::4bc:6dff:fe66:2c93  prefixlen 64  scopeid 0x20<link>
        ether 06:bc:6d:66:2c:93  txqueuelen 1000  (Ethernet)
        RX packets 990921  bytes 559805640 (559.8 MB)
        RX errors 0  dropped 15  overruns 0  frame 0
        TX packets 740812  bytes 75336825 (75.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 233712  bytes 26541723 (26.5 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 233712  bytes 26541723 (26.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens5 is the interface that is used both for public and private communication, and I feel like ProcessGroupGloo just retrieves the IP address of this interface (172.31.15.149), ignoring the fact it is not accessible publicly.
Is there a way to force ProcessGroupGloo to only use the exact IP I give it as MASTER_ADDRESS (here 13.57.23.79).
As problem seems to boil down to ProcessGroupGloo I use the following simple code for debug which gives the exact same error.

import os
import torch.distributed as dist
import sys

rank = int(sys.argv[1])

os.environ['MASTER_ADDR'] = '13.57.23.79'
os.environ['MASTER_PORT'] = '29550'
print(f"[ {os.getpid()} ] Initializing process group")
dist.init_process_group(backend="gloo", rank=rank, world_size=2)
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")

wanchaol · August 23, 2022, 5:27am

@chaoyanghe as you observed processgroup trying to use network interface to keep connected, could you try switching the network interface to sth that’s publicly accessible?

i.e. export GLOO_SOCKET_IFNAME=eth0
https://pytorch.org/docs/stable/distributed.html#choosing-the-network-interface-to-use

amir_zsh · August 23, 2022, 11:07pm

Hi @wanchaol,
Thanks for your reply. I will follow up on this topic on behalf of @chaoyanghe. As the shared ifconfig log shows, the server only has three interfaces (docker0, ens5, lo), so there is no eth0 interface in this case. I believe the eth0 interface is created when the instances reside inside an AWS VPC. In that case, since the private address of eth0 is accessible by both instances, there will be no problem. But our main concern is for the cases where the interface’s address is not accessible publicly (here 172.31.15.149 for ens5). As the client is already provided with the master’s IP and is able to initiate the connection, the automatic IP detection seems unnecessary. So we are wondering if it is possible to enforce the public address (13.57.23.79) of the master to be used and prevent the automatic IP detection, which will end up choosing a private IP.