One GPU always stucks/hangs when using NCCL, while gloo not

When I use torchrun --nproc_per_node=4 to launch DDP (with NCCL), there is always one (sometimes two) GPU stucks (its utility is 0% but memory is normal, i.e., GPU 2 in the image) at the
begining. However, when only 1/2/3 GPUs are used, this issue doesn’t exist.

My experimental environment is listed below.

  • GPU: Quadro RTX 6000*4
  • NVIDIA Driver: 470.86
  • CUDA Version: 11.4
  • Python: 3.6.9
  • Pytorch: 1.10.0
  • Cudatoolkit: 11.3
  • NCCL: 2.10.3

I have tried export NCCL_P2P_DISABLE=1 pointed out by most similar issues, but this doesn’t work for me.

Look forward to your help. I really appreciate it!