DDP - errors with NCCL and GLOO backends, unable to establish connection

industrial · January 10, 2024, 7:22pm

Attempting to use DDP across 2 machines, using code from example at Multinode Training — PyTorch Tutorials 2.2.0+cu121 documentation (slightly adapted so the command line args have default values). Same code on both machines. Both are running Python 3.10.12 in virtual environments.

On the rank 0 machine (1 GEFORCE RTX 3080), I run the following command to start things off:

torchrun --nproc-per-node 1 --nnodes 2 --node-rank 0 --rdzv-id 777 --rdzv-backend c10d --rdzv-endpoint localhost:1840 multinode.py

Regardless of what backend I choose (NCCL/GLOO), it appears to start normally.

On the rank 1 machine (4 GEFORCE GTX TITAN 1080s), I run the following command to attempt to connect:

torchrun --nproc-per-node 4 --nnodes 2 --node-rank 1 --rdzv-id 777 --rdzv-backend c10d --rdzv-endpoint <ip of rank 0>:1840 multinode.py

And this is where the errors seem to start. Note that I use the same backend for both. If using NCCL, then I see this in the debug info for each GPU:

NCCL INFO cudaDriverVersion 12000                                                 
[0] NCCL INFO Bootstrap : Using enp6s0:192.168.1.100<0>                               
[0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so:  
cannot open shared object file: No such file or directory                                                 
[0] NCCL INFO NET/Plugin : No plugin found, using internal implementation             
[0] NCCL INFO NET/IB : No device found.                                               
[0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.100<0>                           
[0] NCCL INFO Using network Socket                                                    
[1] NCCL INFO cudaDriverVersion 12000

And after some time, the following error message:

[2024-01-10 14:00:05,232] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'xxxx_589907_0' has failed to send a keep-alive heartbeat to the rendezvous '777' due to an error of type RendezvousConnectionError.

After starting rank 0, I can confirm through telnet that I’m able to establish a connection with it from rank 1 over port 1840. I set up an iperf server on rank 0 as well and confirmed that rank 1 could reach it and the speeds were good. The machines are also physically close together and on a wired network. I also tried the whole process again with NCCL_P2P_DISABLE=1 on both and 1 or the other of the machines, but still got the same error. Also tried checking if ACS was enabled based on Troubleshooting — NCCL 2.19.3 documentation, and got these results (I think indicating that it is already disabled):

ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

Then I tried the GLOO backend on both, which seems to briefly establish a connection before throwing the following error:

Traceback (most recent call last):                                                                        
  File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 112, in <module>    
    main(args.save_every, args.total_epochs, args.batch_size)                                             
  File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 96, in main         
    ddp_setup()                                                                                           
  File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 14, in ddp_setup    
    init_process_group(backend="gloo")                                                                    
  File "/home/noahg/universal-training-branch/exploration-enabler/.venv/lib/python3.10/site-packages/torc 
h/distributed/c10d_logger.py", line 74, in wrapper                                                        
    func_return = func(*args, **kwargs)                                                                   
  File "/home/noahg/universal-training-branch/exploration-enabler/.venv/lib/python3.10/site-packages/torc 
h/distributed/distributed_c10d.py", line 1148, in init_process_group                                      
Traceback (most recent call last):                                                                        
  File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 112, in <module>    
    default_pg, _ = _new_process_group_helper(                                                            
  File "/home/noahg/universal-training-branch/exploration-enabler/.venv/lib/python3.10/site-packages/torc 
h/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper                               
    main(args.save_every, args.total_epochs, args.batch_size)                                             
  File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 96, in main         
    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)       
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2 
.ss_family. 2 vs 10

I have not been able to find any additional info on this exact RuntimeError for GLOO. Any help would be greatly appreciated. I can provide additional info if needed.

Thanks