Distributed stuck in first call to barrier

Hi I am stuck and could use a hint on distributed.

I’m trying to run a distributed model on a two-GPU machine. Until recently, that seemed to work, but it has stopped. Now it is hanging at the first barrier call.
The NCCL_DEBUG (below with a few funny debug prints) shows “Init COMPLETE” but then stops (will timeout eventually).

Any hint would be appreciated!

The full code in question is this (called from running the test_fsdp.py from the same directory.)

thunder/tests/distributed/test_fsdp.py::FSDPTest::test_fsdp_broadcast_from #####initializing with file:///tmp/tmpil2bmukt #### 0 ###### 2 ##### /tmp/tmpil2bmukt
#####initializing with file:///tmp/tmpil2bmukt #### 1 ###### 2 ##### /tmp/tmpil2bmukt
####returned init
####returned init
###call barrier rank 0
###call barrier rank 1
mackay:2172718:2172718 [0] NCCL INFO Bootstrap : Using enp4s0:192.168.1.1<0>
mackay:2172718:2172718 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mackay:2172718:2172718 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mackay:2172718:2172718 [0] NCCL INFO NET/Plugin: Using internal network plugin.
mackay:2172718:2172718 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.21.5+cuda12.2
mackay:2172719:2172719 [1] NCCL INFO cudaDriverVersion 12030
mackay:2172719:2172719 [1] NCCL INFO Bootstrap : Using enp4s0:192.168.1.1<0>
mackay:2172719:2172719 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
mackay:2172719:2172719 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
mackay:2172719:2172719 [1] NCCL INFO NET/Plugin: Using internal network plugin.
mackay:2172718:2172799 [0] NCCL INFO NET/IB : No device found.
mackay:2172718:2172799 [0] NCCL INFO NET/Socket : Using [0]enp4s0:192.168.1.1<0> [1]enp5s0:192.168.178.37<0>
mackay:2172718:2172799 [0] NCCL INFO Using non-device net plugin version 0
mackay:2172718:2172799 [0] NCCL INFO Using network Socket
mackay:2172719:2172800 [1] NCCL INFO NET/IB : No device found.
mackay:2172719:2172800 [1] NCCL INFO NET/Socket : Using [0]enp4s0:192.168.1.1<0> [1]enp5s0:192.168.178.37<0>
mackay:2172719:2172800 [1] NCCL INFO Using non-device net plugin version 0
mackay:2172719:2172800 [1] NCCL INFO Using network Socket
mackay:2172718:2172799 [0] NCCL INFO ncclCommInitRank comm 0xa761300 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId a000 commId 0x42b80bb4bcb10b00 - Init START
mackay:2172719:2172800 [1] NCCL INFO ncclCommInitRank comm 0x94f4b30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId b000 commId 0x42b80bb4bcb10b00 - Init START
mackay:2172719:2172800 [1] NCCL INFO comm 0x94f4b30 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
mackay:2172718:2172799 [0] NCCL INFO comm 0xa761300 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
mackay:2172719:2172800 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
mackay:2172718:2172799 [0] NCCL INFO Channel 00/02 :    0   1
mackay:2172719:2172800 [1] NCCL INFO P2P Chunksize set to 131072
mackay:2172718:2172799 [0] NCCL INFO Channel 01/02 :    0   1
mackay:2172718:2172799 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
mackay:2172718:2172799 [0] NCCL INFO P2P Chunksize set to 131072
mackay:2172718:2172799 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
mackay:2172719:2172800 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
mackay:2172719:2172800 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
mackay:2172718:2172799 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
mackay:2172719:2172800 [1] NCCL INFO Connected all rings
mackay:2172719:2172800 [1] NCCL INFO Connected all trees
mackay:2172719:2172800 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
mackay:2172718:2172799 [0] NCCL INFO Connected all rings
mackay:2172719:2172800 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mackay:2172718:2172799 [0] NCCL INFO Connected all trees
mackay:2172718:2172799 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
mackay:2172718:2172799 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
mackay:2172719:2172800 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mackay:2172719:2172800 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mackay:2172719:2172800 [1] NCCL INFO ncclCommInitRank comm 0x94f4b30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId b000 commId 0x42b80bb4bcb10b00 - Init COMPLETE
mackay:2172718:2172799 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
mackay:2172718:2172799 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
mackay:2172718:2172799 [0] NCCL INFO ncclCommInitRank comm 0xa761300 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId a000 commId 0x42b80bb4bcb10b00 - Init COMPLETE

Do you have any sense of when it stopped working in case it helps us bisect the issue? Were there any changes in NCCL version?

Unfortunately, I don’t know. It was some time during the last two months, and I’m running a self-compiled PyTorch, so this would give quite a bisection task. :frowning:
I wonder if there is a way to discern what should be happening (like “Waiting for to happen at network address ”) or so in order to know what it is that I need to look at why it’s not happening…

So what I did was grab a docker container in which I knew distributed worked and when that failed, it was clear that it was the driver. Downgrading the nvidia driver helped.
Of course, it would be nice if there was a proper error message somewhere if NCCL didn’t like my driver, but I guess that’s not a PyTorch thing.

1 Like