Using DDP on slurm cluster with NCCL_IB_DISABLE=1 is very slow

I was trying to use DDP on a slurm cluster, but I got error and had to export NCCL_IB_DISABLE=1 to run the program.

I’m not sure if it has anything to do with the flag, the training progress seemed to be very slow. I wonder if it is normal or what. I have tried the same training procedure on two A100 (another machine) and four RTX 3090 (on the slurm cluster), I found that the four RTX 3090 were actually two times slower than the training scripts on the two A100.

If it’s the problem of this flag, is there anyway solve it? Because currently, I can ensure the program is actually running on the same machine (node), and from my understanding, NCCL is trying to use internet interface which leads to error while it’s trying to communicate with IB/RoCE protocol as there’re no suitable hardware. Since the all the processes will be on the same machine, is there anyway to configure slurm cluster, or PyTorch to configure NCCL such that it’s using the interconnects (probably PCIe) to communicate instead of ethernet?

@kwen2501 do you have any insights regarding the NCCL_IB_DISABLE flag?

Hi, if your training is on a single node, NCCL will likely not use any network interconnect for data communication in the first place, whether your network is IB/RoCE or traditional TCP/IP. For intranode communication, NCCL would try to use NVLink, PCI-e and system shared memory, roughly in this order.

If your training does run across multiple nodes, NCCL_IB_DISABLE=1 would disable use of IB/RoCE for internode communication and fall back to socket. If your nodes do not have any IB devices installed, NCCL should be able to detect it and disable IB automatically.

To fully understand the problem, you can set NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV to see your system settings. If you have more questions, you can also post to the NCCL project.

1 Like