I was trying to use DDP on a slurm cluster, but I got error and had to export NCCL_IB_DISABLE=1 to run the program.
I’m not sure if it has anything to do with the flag, the training progress seemed to be very slow. I wonder if it is normal or what. I have tried the same training procedure on two A100 (another machine) and four RTX 3090 (on the slurm cluster), I found that the four RTX 3090 were actually two times slower than the training scripts on the two A100.
If it’s the problem of this flag, is there anyway solve it? Because currently, I can ensure the program is actually running on the same machine (node), and from my understanding, NCCL is trying to use internet interface which leads to error while it’s trying to communicate with IB/RoCE protocol as there’re no suitable hardware. Since the all the processes will be on the same machine, is there anyway to configure slurm cluster, or PyTorch to configure NCCL such that it’s using the interconnects (probably PCIe) to communicate instead of ethernet?