Can not realize distributed training across machines with DDP

Backgroud

Hi, i’m new to distributed training across machines with DDP. I’m currently writing a simple demo to try it out. My demo can run normally on a single-machine-multiple-GPUs paradigm, but it cannot be applied to multiple-machines. I tried many ways but still can’t solve this problem. I really need some help, thanks.

Environment

  • 2 machine, each with 2 RTX3090
  • python==3.6.13
  • pytorch==1.8.0+cu111
  • CUDA==11.1

I have tested in other environment but still facing the same problem:

  • pytorch==1.7.0+cu101
  • pytorch==1.8.1+cu111
  • pytorch==1.9.0+cu111

Code

I uploaded my code to GitHub. There are two versions with different launch methods:

I use the TCP version (mnist-ddp-tcp.py) as an example:

Successfully run on single-machine-multi-GPUs

# bash
python mnist-ddp-tcp.py --init_method tcp://[MACHINE0_IP]:[PORT] --rank 0 --world_size 2 --gpuid 0 
python mnist-ddp-tcp.py --init_method tcp://[MACHINE0_IP]:[PORT] --rank 1 --world_size 2 --gpuid 1

Failed to run on multi-machines

# bash machine-0
python mnist-ddp-tcp.py --init_method tcp://[MACHINE0_IP]:[PORT] --rank 0 --world_size 2 --gpuid 0
# bash machine-1
python mnist-ddp-tcp.py --init_method tcp://[MACHINE0_IP]:[PORT] --rank 1 --world_size 2 --gpuid 0

The program is blocked in L78: dist.barrier(device_ids=[gpu]). When I comment out L78, the program gets stuck at L98.

machine0

Possible causes

I speculate that the problem might be with the firewalls of my two machines. But I also observed that the L77 is printed normally, that is, my process group is initialized normally (L74). So I don’t know how to find and fix the problem.

Looks like you are using NCCL as your process group backend? Do you mind giving it a try with Gloo?

1 Like

Yes, i have been using NCCL before.
Now I tried gloo and the program works fine. Thank you for your advice.

Note

I tried gloo and at the beginning I got “address family mismatch” error that is same as discuss.64753. I solved this by specifying GLOO_SOCKET_IFNAME.

# bash
export GLOO_SOCKET_IFNAME=eno2np1

Inspired by this, i also tried specifying NCCL_SOCKET_IFNAME, but that didn’t fix the nccl problem.

# bash
export NCCL_SOCKET_IFNAME=eno2np1

So I still can’t use nccl at the moment. Gloo’s lack of support for some operators limits the functionality in the code. Could you please give me some advice on how to solve this problem?

Supplement

When I use nccl backend, machine0 is blocked in L78: dist.barrier(device_ids=[gpu]). After a period of time, machine1 will automatically terminate with the following error:

@KaiiZhang looks like both ranks passed the init_process_group() and they failed at dist.barrier() step in your codes?

Yes, that is what I think is confusing.
In fact, when the environment is changed to “ pytorch==1.9.0+cu111”, process-1 is able to pass the dist.barrier() , and then gets stuck at the next synchronization point. Process-0 is still stuck in the barrier.

Did you come up with any solution to this?
I’m wondering if this could be some TCP issue, ie the two remote server nodes just cannot communicate for some reason… :frowning_face:

I think the problem is your firewall. Try this.

@KaiiZhang Can you run with the environment variable NCCL_DEBUG=INFO, that would give more information about why NCCL is failing. Could you share the entire logs for all processes after setting NCCL_DEBUG=INFO.