Connection refused with GLOO process group initialization

HuYang719 · February 23, 2022, 4:32pm

Dear all, I attempt to use TCP initial method to initial two nodes in AWS EC2 (different area but with all traffic allow):

def init_process(rank, size, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ["TP_SOCKET_IFNAME"] = 'ens5'
    os.environ["GLOO_SOCKET_IFNAME"] = 'ens5'
    initial_tcp = "tcp://44.234.152.249:23456"
    dist.init_process_group(backend, init_method=initial_tcp,rank=rank, world_size=size, timeout=timeout)
    run(rank, size)

if __name__ == "__main__":
    size = 2
    processes = []
    mp.set_start_method("spawn")
    init_process(0, size, backend='gloo') # node 0
# init_process(1, size, backend='gloo') for node 1

And I make sure these two nodes can communicate with public IP using socket. I tried binding the public IP with the node rank 1:

However, when setting up the initialization, the rank 0 node reports error:

RuntimeError: [/opt/conda/conda-bld/pytorch_1634272172048/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] 
connect [13.234.25.113]:42897: Connection refused

and rank 1 node (with public IP 13.234.25.113):

RuntimeError: Connection reset by peer

I have used netcat to make sure that the port 42897 on the rank 1 node can be connected successfully with the rank 0 node.

I do not know why this happens, is it because rank 1 node opened a different port for initialization (that’s my understanding for connection refused). Could anyone help me? Thanks!

Yanli_Zhao · February 23, 2022, 9:33pm

@HuYang719 do you want to try to use the same port for two nodes and see whether it can resolve the issue? also cc @cbalioglu

HuYang719 · February 24, 2022, 3:24am

Hi Yanli, I am not sure how to use the same port to initiate the process group. The worker node seems to try to open a random port and (also changes with the time) send the SYN package to the master node. I feel it is not controlled by the code.
Is there any method to use the same port for both worker node and master node?

HuYang719 · February 24, 2022, 7:47am

Hi, dear @Yanli_Zhao @cbalioglu , I want to update some progress here. But I wonder the most essential question for torch.distributed (GLOO/NCCL backend):

If the master and worker nodes can be accessed with public IP (they may be in their own LAN but no firewall, and after exposing public IP and allowing all traffics, also confirmed they can communicate using socket), they should be able to communicate using torch.distributed, right?

This question is important since I notice the gloo is also based on the socket, so the answer is “yes” to me. But I want to double-check here.

Update: I use tshark to capture packages between master and worker nodes during the initialization and found they established a connection for one worker port, but this port is different from the one in the error report from the master node.

e.g. master node shows:

RuntimeError: [/opt/conda/conda-bld/pytorch_1634272172048/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] 
connect [13.234.25.113]:48516: Connection refused

and the tshark capture:

   20 3.422250509 172.31.46.74 → 54.68.21.98  TCP 74 39874 → 23456 [SYN] Seq=0 Win=62727 Len=0 MSS=8961 SACK_PERM=1 TSval=1144394851 TSecr=0 WS=128
   22 3.633177330  54.68.21.98 → 172.31.46.74 TCP 74 23456 → 39874 [SYN, ACK] Seq=0 Ack=1 Win=62643 Len=0 MSS=1460 SACK_PERM=1 TSval=1306173413 TSecr=1144394851 WS=128
   23 3.633196068 172.31.46.74 → 54.68.21.98  TCP 66 39874 → 23456 [ACK] Seq=1 Ack=1 Win=62848 Len=0 TSval=1144395062 TSecr=1306173413
   24 3.633234204 172.31.46.74 → 54.68.21.98  TCP 67 39874 → 23456 [PSH, ACK] Seq=1 Ack=1 Win=62848 Len=1 TSval=1144395062 TSecr=1306173413
   25 3.633240735 172.31.46.74 → 54.68.21.98  TCP 87 39874 → 23456 [PSH, ACK] Seq=2 Ack=1 Win=62848 Len=21 TSval=1144395062 TSecr=1306173413 [TCP segment of a reassembled PDU]
   27 3.844343638  54.68.21.98 → 172.31.46.74 TCP 66 23456 → 39874 [ACK] Seq=1 Ack=2 Win=62720 Len=0 TSval=1306173624 TSecr=1144395062
...
  76 5.014276199 172.31.46.74 → 54.68.21.98  TCP 66 39874 → 23456 [FIN, ACK] Seq=402 Ack=27 Win=62848 Len=0 TSval=1144396443 TSecr=1306174738

54.68.21.98 and 23456 are master address/port. I do not know why the master node reports the port 48516 goes wrong. Hope to hear your suggestions. thanks!

cbalioglu · February 24, 2022, 2:50pm

@HuYang719 Note that the master address/port you have specified (i.e. 54.68.21.98 and 23456) are used by the TCPStore that is responsible for establishing a “rendezvous” between workers during process bootstrapping. That socket is not related to Gloo. Once a rendezvous is established, Gloo uses its own socket internally (based on your example looks like it picked the port 48516 on rank1) for its own communication. You problem still seems to be related to some firewall issue. Are you %100 sure that you allow all traffic between the nodes?

taehyunzzz · March 2, 2022, 5:04pm

As @cbalioglu pointed out for us, the initial master IP and port are used for rendezvous only. Random ports are used for post-rendezvous communications, so you might be having a firewall and port problem.
This could be risky in terms of security, but I solved a similar problem by doing :

node 0 : sudo ufw allow from [node1 IP]
node 1 : sudo ufw allow from [node0 IP]

I hope there will be some updates for not having to do this… Hope this helps tho