Hi, i’m new to distributed training across machines with DDP. I’m currently writing a simple demo to try it out. My demo can run normally on a single-machine-multiple-GPUs paradigm, but it cannot be applied to multiple-machines. I tried many ways but still can’t solve this problem. I really need some help, thanks.
Environment
2 machine, each with 2 RTX3090
python==3.6.13
pytorch==1.8.0+cu111
CUDA==11.1
I have tested in other environment but still facing the same problem:
pytorch==1.7.0+cu101
pytorch==1.8.1+cu111
pytorch==1.9.0+cu111
Code
I uploaded my code to GitHub. There are two versions with different launch methods:
I speculate that the problem might be with the firewalls of my two machines. But I also observed that the L77 is printed normally, that is, my process group is initialized normally (L74). So I don’t know how to find and fix the problem.
Yes, i have been using NCCL before.
Now I tried gloo and the program works fine. Thank you for your advice.
Note
I tried gloo and at the beginning I got “address family mismatch” error that is same as discuss.64753. I solved this by specifying GLOO_SOCKET_IFNAME.
# bash
export GLOO_SOCKET_IFNAME=eno2np1
Inspired by this, i also tried specifying NCCL_SOCKET_IFNAME, but that didn’t fix the nccl problem.
# bash
export NCCL_SOCKET_IFNAME=eno2np1
So I still can’t use nccl at the moment. Gloo’s lack of support for some operators limits the functionality in the code. Could you please give me some advice on how to solve this problem?
When I use nccl backend, machine0 is blocked in L78: dist.barrier(device_ids=[gpu]). After a period of time, machine1 will automatically terminate with the following error:
Yes, that is what I think is confusing.
In fact, when the environment is changed to “ pytorch==1.9.0+cu111”, process-1 is able to pass the dist.barrier() , and then gets stuck at the next synchronization point. Process-0 is still stuck in the barrier.
Did you come up with any solution to this?
I’m wondering if this could be some TCP issue, ie the two remote server nodes just cannot communicate for some reason…
@KaiiZhang Can you run with the environment variable NCCL_DEBUG=INFO, that would give more information about why NCCL is failing. Could you share the entire logs for all processes after setting NCCL_DEBUG=INFO.