I’m using pytorch on a cluster connected by infiniband(56Gb FDR).
I want to run a distributed training, where each process controls one GPU and the gradients are averaged cross processes by ‘allreduce’(I’m using mpi backend). I except this should scale well just like mpi-based caffe with Inifiniband support.
So I build pytorch from source and WITH_DISTRIBUTED=1, also i’m sure that the MPI libraries are build with Infiniband support(work well with mpi-based caffe). I expect this run faster on 8 GPUs than using ‘DataParallel’, bypassing GIL issues. But actually the performance was poorer.
After some profiling I found that the bottleneck is in ‘allreduce’, which should be faster with infiniband. To ensure that the communication is running through IB, I test the point-to-point bandwidth using dist.send/recv. It’s about 3.7GB/s, which is weird, since Ethernet generally cannot reach this number, while it is only about half of the theoretical bandwidth of Infiniband(I also test the bandwidth using osu benchmark, which shows 11GB/s and 6GB/s intra or inter node).
Here are my numbers:
On Titan Xp, resnet50 with batchsize 32 per GPU runs for 0.15s per iteration for 1 GPU and 0.3s per iteration for 8 GPU when using ‘DataParallel’.
When using 8 processes and average gradients by ‘allreduce’ after loss.backward, it runs for 0.45s per iteration, which is even slower than ‘DataParallel’. It cost about 0.3s in allreduce.
It seems that MPI is not working properly with pytorch under my setttings. Am I missed something? I have searched and tried for 2 days but still cannot get it work, really appreciated for any help!
PS: I also tried the ‘gloo’ backend, with the infiniband patch issued here, it could run but performed even poorer(similar to ‘tcp’ backend).