Connect [127.0.1.1]:[a port]: Connection refused

@lcw Do you have any other guess that I can check? thank you

1 Like

I have the same problem while running ps across multiple instances. Any fix for this RPC connection problem?

Which version of PyTorch are you using? Could you compare the ProcessGroup backend and the TensorPipe backend? Does the error occur on both? Could you set the GLOO_SOCKET_IFNAME and the TP_SOCKET_IFNAME env vars on your machines to the name of one of the valid network interfaces? This could help figure out the issue.

1 Like

Ok, so I had the same error and I got working by doing the following:

1: Use the process group backend (not sure if necessary, dont have time to test)
2: Edit /etc/hosts and change 127.0.1.1 to your actual ip-address, it seems like you only need to do this on the master machine.

1 Like

The #2 suggestion works, but it needs to edit on both the master and the workers.

Hi, I’m facing just the same problem here. I think this is due to the permission problem, and if the Client machine could get the permit to access Master IP & Port, it would be fine. So my question is that there is just no rules for these port numbers created by ther kernel? If I want to ask the server admin to give my permission, I can only ask him to give my client machine access to any ports of the master machine, is that correct?

Also, I used netcat to check my problem. For the random ports created by the kernel, I can’t connect to this port on master machine from client mchine. But for the MASTER_ADDR & PORT I assigned, admin already gives me access to this port, so I am able to netcat connecting it. Therefore I confirm that it is a permission problem.

I had a similar problem when trying to do distributed training on two remote servers :frowning:

My two servers were located on different racks so they had to communicate via tcp and ethernet. Whenever I ran torch distributed training, the process on each node just froze on NCCL operations (operations starting with torch.distributed.xxx). The problem was that the master address and port that you initially specify are not used for communication between the two nodes. Rather, from what I have encountered, after a rendezvous of all the nodes, the ports used for communication changes every time the nodes communicate. From what I remember the master node might or might not change (I need to check on this to be sure). For this reason, the firewall kept preventing the remote nodes from communicating because the ports were random for every NCCL operation. My solution was to open all ports of a node to the IP addresses of all the other nodes that are trying to work together.

Say you are trying to use 2 remote nodes, then my solution was to add a ‘sudo ufw’ rule to allow access from all the node IPs working together.
Node 0 : sudo ufw allow from [node1 IP]
Node 1 : sudo ufw allow from [node0 IP]

The solution above solved my problem for both NCCL and GLOO backends. For security reasons, this solution might not be recommended but at least I found out what was causing the code to hang.
Hope this helps :+1:

I have this problem while running Pytorch inside Docker on a main server and workers. Any idea how to make sure that 127.0.0.1 is redirected to the real host?