NCCL backend hangs for single node multi-gpu training

Hello all,

I am running the multi_gpu.py example for distributed training on two GPU machines which are on the same linux Ubuntu 20.04 machine. I haven’t modified the code whatsoever. When I execute the file (with nccl backend), the code hangs during the DDP constructor creation.

I have pretty much tried everything that is out there on pytorch forums as well as github issues with no luck.

Would appreciate any help in resolving this error.

NOTE: With gloo backend I am able to get things working without any hangs.

Environment details:

pytorch                   2.4.0           py3.9_cuda11.8_cudnn9.1.0_0    pytorch
pytorch-cuda              11.8                 h7e8668a_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                2.4.0                py39_cu118    pytorch
torchtriton               3.0.0                      py39    pytorch
torchvision               0.19.0               py39_cu118    pytorch

Details about the GPUs:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "NVIDIA RTX A5000"
  CUDA Driver Version / Runtime Version          12.4 / 11.8
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24027 MBytes (25193807872 bytes)
  (064) Multiprocessors, (128) CUDA Cores/MP:    8192 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA RTX A5000"
  CUDA Driver Version / Runtime Version          12.4 / 11.8
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24027 MBytes (25193807872 bytes)
  (064) Multiprocessors, (128) CUDA Cores/MP:    8192 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA RTX A5000 (GPU0) -> NVIDIA RTX A5000 (GPU1) : Yes
> Peer access from NVIDIA RTX A5000 (GPU1) -> NVIDIA RTX A5000 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 11.8, NumDevs = 2
Result = PASS

Error stack trace:

ddp$ python3 multi.py 50 10

Starting trainer on device ID: 1

Starting trainer on device ID: 0

machine01:124359:124359 [0] NCCL INFO Bootstrap : Using eno1:10.131.27.116<0>

machine01:124359:124359 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation

machine01:124359:124359 [0] NCCL INFO cudaDriverVersion 12040

NCCL version 2.20.5+cuda11.8

machine01:124359:124359 [0] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x7fb60f800000

machine01:124360:124360 [1] NCCL INFO cudaDriverVersion 12040

machine01:124360:124360 [1] NCCL INFO Bootstrap : Using eno1:10.131.27.116<0>

machine01:124360:124360 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation

machine01:124360:124360 [1] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x7f5f79800000

machine01:124360:124432 [1] NCCL INFO NET/IB : No device found.

machine01:124360:124432 [1] NCCL INFO NET/Socket : Using [0]eno1:10.131.27.116<0>

machine01:124360:124432 [1] NCCL INFO Using non-device net plugin version 0

machine01:124360:124432 [1] NCCL INFO Using network Socket

machine01:124359:124431 [0] NCCL INFO NET/IB : No device found.

machine01:124359:124431 [0] NCCL INFO NET/Socket : Using [0]eno1:10.131.27.116<0>

machine01:124359:124431 [0] NCCL INFO Using non-device net plugin version 0

machine01:124359:124431 [0] NCCL INFO Using network Socket

machine01:124360:124432 [1] NCCL INFO comm 0x5623f924e240 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 81000 commId 0x81bbd591fef00891 - Init START

machine01:124359:124431 [0] NCCL INFO comm 0x5583e9baa540 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x81bbd591fef00891 - Init START

machine01:124360:124432 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eno1'

machine01:124360:124432 [1] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===

machine01:124360:124432 [1] NCCL INFO CPU/0 (1/2/-1)

machine01:124360:124432 [1] NCCL INFO + PCI[24.0] - GPU/1000 (0)

machine01:124360:124432 [1] NCCL INFO + PCI[24.0] - GPU/81000 (1)

machine01:124360:124432 [1] NCCL INFO + PCI[0.8] - NIC/C2000

machine01:124360:124432 [1] NCCL INFO ==========================================

machine01:124360:124432 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/81000 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB)

machine01:124360:124432 [1] NCCL INFO GPU/81000 :GPU/1000 (2/24.000000/PHB) GPU/81000 (0/5000.000000/LOC) CPU/0 (1/24.000000/PHB)

...

machine01:124359:124359 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)

machine01:124360:124360 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f5f79600400 recvbuff 0x7f5f79600a00 count 8 datatype 0 op 0 root 0 comm 0x5623f924e240 [nranks=2] stream 0x5623f924e170

machine01:124359:124359 [0] NCCL INFO 16 Bytes -> Algo 1 proto 0 time 7.601000

machine01:124360:124360 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)

[rank1]:[E913 22:56:54.397891680 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.

[rank1]:[E913 22:56:54.398061382 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.

[rank1]:[E913 22:56:54.398077022 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.

[rank1]:[E913 22:56:54.398083332 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

[rank1]:[E913 22:56:54.398088912 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.

[rank0]:[E913 22:56:54.398202694 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.

[rank0]:[E913 22:56:54.398332895 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.

[rank0]:[E913 22:56:54.398342595 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.

[rank0]:[E913 22:56:54.398348925 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

[rank0]:[E913 22:56:54.398354025 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.

[rank1]:[E913 22:56:54.399640210 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.

Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6050cf5f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f6051fe2f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6051fe9943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6051febd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #4: <unknown function> + 0xd3b55 (0x7f60b46cfb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)

frame #5: <unknown function> + 0x8609 (0x7f60bc655609 in /lib/x86_64-linux-gnu/libpthread.so.0)

frame #6: clone + 0x43 (0x7f60bc420353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

[rank0]:[E913 22:56:54.399892952 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.

Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb6eb265f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb6ec552f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb6ec559943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb6ec55bd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #4: <unknown function> + 0xd3b55 (0x7fb74ec3fb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)

frame #5: <unknown function> + 0x8609 (0x7fb756bc5609 in /lib/x86_64-linux-gnu/libpthread.so.0)

frame #6: clone + 0x43 (0x7fb756990353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.

Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6050cf5f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f6051fe2f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6051fe9943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6051febd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #4: <unknown function> + 0xd3b55 (0x7f60b46cfb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)

frame #5: <unknown function> + 0x8609 (0x7f60bc655609 in /lib/x86_64-linux-gnu/libpthread.so.0)

frame #6: clone + 0x43 (0x7f60bc420353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6050cf5f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #1: <unknown function> + 0xe52446 (0x7f6051c75446 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #2: <unknown function> + 0xd3b55 (0x7f60b46cfb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)

frame #3: <unknown function> + 0x8609 (0x7f60bc655609 in /lib/x86_64-linux-gnu/libpthread.so.0)

frame #4: clone + 0x43 (0x7f60bc420353 in /lib/x86_64-linux-gnu/libc.so.6)

what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.

Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb6eb265f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb6ec552f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb6ec559943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb6ec55bd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #4: <unknown function> + 0xd3b55 (0x7fb74ec3fb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)

frame #5: <unknown function> + 0x8609 (0x7fb756bc5609 in /lib/x86_64-linux-gnu/libpthread.so.0)

frame #6: clone + 0x43 (0x7fb756990353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb6eb265f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)

frame #1: <unknown function> + 0xe52446 (0x7fb6ec1e5446 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

frame #2: <unknown function> + 0xd3b55 (0x7fb74ec3fb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)

frame #3: <unknown function> + 0x8609 (0x7fb756bc5609 in /lib/x86_64-linux-gnu/libpthread.so.0)

frame #4: clone + 0x43 (0x7fb756990353 in /lib/x86_64-linux-gnu/libc.so.6)

W0913 22:56:55.349663 140695308830528 torch/multiprocessing/spawn.py:146] Terminating process 124359 via signal SIGTERM

Traceback (most recent call last):

File "/home/alonso/multi.py", line 120, in <module>

mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)

File "/home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn

return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")

File "/home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes

while not context.join():

File "/home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 170, in join

raise ProcessExitedException(

torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
  • @ptrblck Adding you since you were active on other threads related to these issues

I am facing a similar issue. I tried both gloo and nccl, the training always hangs at DDP(model) step. :frowning: