I was trying to do a DDP transformer training on two machines (called machine 1 and 2) when I found the whole script stuck at dist.send and dist.recv. This script could work well when all processes are on machine 1 so I was confusing.
First, I simplified the code to a single send/recv
import torch
import torch.distributed as dist
# For machine1
dist.init_process_group(backend="nccl", init_method="tcp://machine1:port", rank=0, world_size=2)
a = torch.zeros(1, device="cuda")
dist.recv(a, src=1)
# For machine2
dist.init_process_group(backend="nccl", init_method="tcp://machine1:port", rank=1, world_size=2)
a = torch.tensor([1], device="cuda")
dist.send(a, dst=0)
This didn’t change anything, dist.init_process_group can work, and both process can get information from things like dist.get_rank(), but they still stuck at send/recv
Next, I tested the network connection between machines. Ping, telnet and nc all worked, so I believe it’s not a network connection or firewall issue.
Then I used another machine (machine 3) which also passed network test instead of machine 2, nothing changed.
I also searched for this issue, and found solutions like setting NCCL_IB_DISABLE=1 (from Torch distributed not working on two machines [nccl backend]), do not work.
I tried to change to gloo backend, but this time I got a address family mismatch error while dist.init_process_group, and I also can’t solve it.
All my machines are Ubuntu, python 3.8.18 and pytorch 1.8.2, cuda 11.1 nccl 2708. I’m not very familiar to pytorch and cuda so I’m not sure if more information or more check is needed. If so, please tell me and I’m willing to provide any information.