Torch.distributed.send/recv not working

I was trying to do a DDP transformer training on two machines (called machine 1 and 2) when I found the whole script stuck at dist.send and dist.recv. This script could work well when all processes are on machine 1 so I was confusing.

First, I simplified the code to a single send/recv

import torch
import torch.distributed as dist

# For machine1
dist.init_process_group(backend="nccl", init_method="tcp://machine1:port", rank=0, world_size=2)
a = torch.zeros(1, device="cuda")
dist.recv(a, src=1)

# For machine2
dist.init_process_group(backend="nccl", init_method="tcp://machine1:port", rank=1, world_size=2)
a = torch.tensor([1], device="cuda")
dist.send(a, dst=0)

This didn’t change anything, dist.init_process_group can work, and both process can get information from things like dist.get_rank(), but they still stuck at send/recv

Next, I tested the network connection between machines. Ping, telnet and nc all worked, so I believe it’s not a network connection or firewall issue.

Then I used another machine (machine 3) which also passed network test instead of machine 2, nothing changed.

I also searched for this issue, and found solutions like setting NCCL_IB_DISABLE=1 (from Torch distributed not working on two machines [nccl backend]), do not work.

I tried to change to gloo backend, but this time I got a address family mismatch error while dist.init_process_group, and I also can’t solve it.

All my machines are Ubuntu, python 3.8.18 and pytorch 1.8.2, cuda 11.1 nccl 2708. I’m not very familiar to pytorch and cuda so I’m not sure if more information or more check is needed. If so, please tell me and I’m willing to provide any information.

Problem solved.
It seems that nccl will find random usable port for each process, and send/recv try to directly connect to that random port, which is blocked by firewall.

However, I am still wondering if this port can be limited to some range, or fixed port? Open all ports on firewall seems not a good idea.

1 Like

What changes did you make that resolved the issue?