Can DistributedDataParallel with gloo backend utilize the infiniband?

Hello everyone,

I managed to run a resnet mode using DistributedDataParallel on a cluster, which uses slurm to manage the resources. The backend I’m using for DistributedDataParallel is gloo. Right now, the model is very slow and the speed does not go up when using more GPUs. It looks like the data transfer between the nodes is the bottleneck, because the GPU utilization is cycling betwee 0% to 100%. I checked the network transfer between the nodes using nodes using netstats. It shows that the data transfer protocol is tcp.

The cluster has infiniband. I’m wondering if I can change the data transfer through the infiniband.

Thanks.

it can with some very small patches. we are working on merging them to master.

That would be great! I’ll try it out once it is available.

Hi here is the patch, https://github.com/pytorch/pytorch/pull/2903. Feel free to try it out and let us know if you have any questions.
Alternatively, if you have IPoIB setup, it should be easier to use IPoIB ip address directly without re-compilation. In that way you can also utilize the IB link.