I managed to run a resnet mode using DistributedDataParallel on a cluster, which uses slurm to manage the resources. The backend I’m using for DistributedDataParallel is gloo. Right now, the model is very slow and the speed does not go up when using more GPUs. It looks like the data transfer between the nodes is the bottleneck, because the GPU utilization is cycling betwee 0% to 100%. I checked the network transfer between the nodes using nodes using netstats. It shows that the data transfer protocol is tcp.
The cluster has infiniband. I’m wondering if I can change the data transfer through the infiniband.