When I use torchrun --nproc_per_node=4 main.py
to launch DDP (with NCCL), there is always one (sometimes two) GPU stucks (its utility is 0% but memory is normal, i.e., GPU 2 in the image) at the
begining. However, when only 1/2/3 GPUs are used, this issue doesn’t exist.
My experimental environment is listed below.
- GPU: Quadro RTX 6000*4
- NVIDIA Driver: 470.86
- CUDA Version: 11.4
- Python: 3.6.9
- Pytorch: 1.10.0
- Cudatoolkit: 11.3
- NCCL: 2.10.3
I have tried export NCCL_P2P_DISABLE=1
pointed out by most similar issues, but this doesn’t work for me.
Look forward to your help. I really appreciate it!