Hi I am stuck and could use a hint on distributed.
I’m trying to run a distributed model on a two-GPU machine. Until recently, that seemed to work, but it has stopped. Now it is hanging at the first barrier call.
The NCCL_DEBUG (below with a few funny debug prints) shows “Init COMPLETE” but then stops (will timeout eventually).
Any hint would be appreciated!
The full code in question is this (called from running the test_fsdp.py from the same directory.)
thunder/tests/distributed/test_fsdp.py::FSDPTest::test_fsdp_broadcast_from #####initializing with file:///tmp/tmpil2bmukt #### 0 ###### 2 ##### /tmp/tmpil2bmukt
#####initializing with file:///tmp/tmpil2bmukt #### 1 ###### 2 ##### /tmp/tmpil2bmukt
####returned init
####returned init
###call barrier rank 0
###call barrier rank 1
mackay:2172718:2172718 [0] NCCL INFO Bootstrap : Using enp4s0:192.168.1.1<0>
mackay:2172718:2172718 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mackay:2172718:2172718 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mackay:2172718:2172718 [0] NCCL INFO NET/Plugin: Using internal network plugin.
mackay:2172718:2172718 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.21.5+cuda12.2
mackay:2172719:2172719 [1] NCCL INFO cudaDriverVersion 12030
mackay:2172719:2172719 [1] NCCL INFO Bootstrap : Using enp4s0:192.168.1.1<0>
mackay:2172719:2172719 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mackay:2172719:2172719 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mackay:2172719:2172719 [1] NCCL INFO NET/Plugin: Using internal network plugin.
mackay:2172718:2172799 [0] NCCL INFO NET/IB : No device found.
mackay:2172718:2172799 [0] NCCL INFO NET/Socket : Using [0]enp4s0:192.168.1.1<0> [1]enp5s0:192.168.178.37<0>
mackay:2172718:2172799 [0] NCCL INFO Using non-device net plugin version 0
mackay:2172718:2172799 [0] NCCL INFO Using network Socket
mackay:2172719:2172800 [1] NCCL INFO NET/IB : No device found.
mackay:2172719:2172800 [1] NCCL INFO NET/Socket : Using [0]enp4s0:192.168.1.1<0> [1]enp5s0:192.168.178.37<0>
mackay:2172719:2172800 [1] NCCL INFO Using non-device net plugin version 0
mackay:2172719:2172800 [1] NCCL INFO Using network Socket
mackay:2172718:2172799 [0] NCCL INFO ncclCommInitRank comm 0xa761300 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId a000 commId 0x42b80bb4bcb10b00 - Init START
mackay:2172719:2172800 [1] NCCL INFO ncclCommInitRank comm 0x94f4b30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId b000 commId 0x42b80bb4bcb10b00 - Init START
mackay:2172719:2172800 [1] NCCL INFO comm 0x94f4b30 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
mackay:2172718:2172799 [0] NCCL INFO comm 0xa761300 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
mackay:2172719:2172800 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
mackay:2172718:2172799 [0] NCCL INFO Channel 00/02 : 0 1
mackay:2172719:2172800 [1] NCCL INFO P2P Chunksize set to 131072
mackay:2172718:2172799 [0] NCCL INFO Channel 01/02 : 0 1
mackay:2172718:2172799 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
mackay:2172718:2172799 [0] NCCL INFO P2P Chunksize set to 131072
mackay:2172718:2172799 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
mackay:2172719:2172800 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
mackay:2172719:2172800 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
mackay:2172718:2172799 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
mackay:2172719:2172800 [1] NCCL INFO Connected all rings
mackay:2172719:2172800 [1] NCCL INFO Connected all trees
mackay:2172719:2172800 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
mackay:2172718:2172799 [0] NCCL INFO Connected all rings
mackay:2172719:2172800 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mackay:2172718:2172799 [0] NCCL INFO Connected all trees
mackay:2172718:2172799 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
mackay:2172718:2172799 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mackay:2172719:2172800 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mackay:2172719:2172800 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mackay:2172719:2172800 [1] NCCL INFO ncclCommInitRank comm 0x94f4b30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId b000 commId 0x42b80bb4bcb10b00 - Init COMPLETE
mackay:2172718:2172799 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mackay:2172718:2172799 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mackay:2172718:2172799 [0] NCCL INFO ncclCommInitRank comm 0xa761300 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId a000 commId 0x42b80bb4bcb10b00 - Init COMPLETE