One GPU always stucks/hangs when using NCCL, while gloo not

Will_Chi · May 31, 2023, 12:34pm

When I use torchrun --nproc_per_node=4 main.py to launch DDP (with NCCL), there is always one (sometimes two) GPU stucks (its utility is 0% but memory is normal, i.e., GPU 2 in the image) at the
begining. However, when only 1/2/3 GPUs are used, this issue doesn’t exist.

My experimental environment is listed below.

GPU: Quadro RTX 6000*4
NVIDIA Driver: 470.86
CUDA Version: 11.4
Python: 3.6.9
Pytorch: 1.10.0
Cudatoolkit: 11.3
NCCL: 2.10.3

I have tried export NCCL_P2P_DISABLE=1 pointed out by most similar issues, but this doesn’t work for me.

Look forward to your help. I really appreciate it!