Attempting to use DDP across 2 machines, using code from example at Multinode Training — PyTorch Tutorials 2.2.0+cu121 documentation (slightly adapted so the command line args have default values). Same code on both machines. Both are running Python 3.10.12 in virtual environments.
On the rank 0 machine (1 GEFORCE RTX 3080), I run the following command to start things off:
torchrun --nproc-per-node 1 --nnodes 2 --node-rank 0 --rdzv-id 777 --rdzv-backend c10d --rdzv-endpoint localhost:1840 multinode.py
Regardless of what backend I choose (NCCL/GLOO), it appears to start normally.
On the rank 1 machine (4 GEFORCE GTX TITAN 1080s), I run the following command to attempt to connect:
torchrun --nproc-per-node 4 --nnodes 2 --node-rank 1 --rdzv-id 777 --rdzv-backend c10d --rdzv-endpoint <ip of rank 0>:1840 multinode.py
And this is where the errors seem to start. Note that I use the same backend for both. If using NCCL, then I see this in the debug info for each GPU:
NCCL INFO cudaDriverVersion 12000
[0] NCCL INFO Bootstrap : Using enp6s0:192.168.1.100<0>
[0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so:
cannot open shared object file: No such file or directory
[0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
[0] NCCL INFO NET/IB : No device found.
[0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.100<0>
[0] NCCL INFO Using network Socket
[1] NCCL INFO cudaDriverVersion 12000
And after some time, the following error message:
[2024-01-10 14:00:05,232] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'xxxx_589907_0' has failed to send a keep-alive heartbeat to the rendezvous '777' due to an error of type RendezvousConnectionError.
After starting rank 0, I can confirm through telnet that I’m able to establish a connection with it from rank 1 over port 1840. I set up an iperf server on rank 0 as well and confirmed that rank 1 could reach it and the speeds were good. The machines are also physically close together and on a wired network. I also tried the whole process again with NCCL_P2P_DISABLE=1 on both and 1 or the other of the machines, but still got the same error. Also tried checking if ACS was enabled based on Troubleshooting — NCCL 2.19.3 documentation, and got these results (I think indicating that it is already disabled):
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Then I tried the GLOO backend on both, which seems to briefly establish a connection before throwing the following error:
Traceback (most recent call last):
File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 112, in <module>
main(args.save_every, args.total_epochs, args.batch_size)
File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 96, in main
ddp_setup()
File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 14, in ddp_setup
init_process_group(backend="gloo")
File "/home/noahg/universal-training-branch/exploration-enabler/.venv/lib/python3.10/site-packages/torc
h/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "/home/noahg/universal-training-branch/exploration-enabler/.venv/lib/python3.10/site-packages/torc
h/distributed/distributed_c10d.py", line 1148, in init_process_group
Traceback (most recent call last):
File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 112, in <module>
default_pg, _ = _new_process_group_helper(
File "/home/noahg/universal-training-branch/exploration-enabler/.venv/lib/python3.10/site-packages/torc
h/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper
main(args.save_every, args.total_epochs, args.batch_size)
File "/home/noahg/universal-training-branch/exploration-enabler/multinode.py", line 96, in main
backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2
.ss_family. 2 vs 10
I have not been able to find any additional info on this exact RuntimeError for GLOO. Any help would be greatly appreciated. I can provide additional info if needed.
Thanks