I have used netcat to make sure that the port 42897 on the rank 1 node can be connected successfully with the rank 0 node.
I do not know why this happens, is it because rank 1 node opened a different port for initialization (that’s my understanding for connection refused). Could anyone help me? Thanks!
Hi Yanli, I am not sure how to use the same port to initiate the process group. The worker node seems to try to open a random port and (also changes with the time) send the SYN package to the master node. I feel it is not controlled by the code.
Is there any method to use the same port for both worker node and master node?
Hi, dear @Yanli_Zhao@cbalioglu , I want to update some progress here. But I wonder the most essential question for torch.distributed (GLOO/NCCL backend):
If the master and worker nodes can be accessed with public IP (they may be in their own LAN but no firewall, and after exposing public IP and allowing all traffics, also confirmed they can communicate using socket), they should be able to communicate using torch.distributed, right?
This question is important since I notice the gloo is also based on the socket, so the answer is “yes” to me. But I want to double-check here.
Update: I use tshark to capture packages between master and worker nodes during the initialization and found they established a connection for one worker port, but this port is different from the one in the error report from the master node.
54.68.21.98 and 23456 are master address/port. I do not know why the master node reports the port 48516 goes wrong. Hope to hear your suggestions. thanks!
@HuYang719 Note that the master address/port you have specified (i.e. 54.68.21.98 and 23456) are used by the TCPStore that is responsible for establishing a “rendezvous” between workers during process bootstrapping. That socket is not related to Gloo. Once a rendezvous is established, Gloo uses its own socket internally (based on your example looks like it picked the port 48516 on rank1) for its own communication. You problem still seems to be related to some firewall issue. Are you %100 sure that you allow all traffic between the nodes?
As @cbalioglu pointed out for us, the initial master IP and port are used for rendezvous only. Random ports are used for post-rendezvous communications, so you might be having a firewall and port problem.
This could be risky in terms of security, but I solved a similar problem by doing :
node 0 : sudo ufw allow from [node1 IP]
node 1 : sudo ufw allow from [node0 IP]
I hope there will be some updates for not having to do this… Hope this helps tho