Pytorch distributed training on extremely large GPU fleet


(Arvind Mohan) #1

I was wondering if anyone can point me to resources on how Pytorch scales on extremely large scale parallel jobs - I am researcher with a group working on ML for computational physics, with all my codes being in Pytorch.

Currently we are evaluating ML frameworks which can scale on super-massive distributed jobs i.e. > 5000 GPUs. Any idea on what is the largest distributed job using Pytorch so far (both in public domain and internally at Facebook) ? Are there any fundamental limitations which prevent it from scaling at that level? If so, can I expect any improvements with 1.0 release?

I would like to avoid re-writing all my codes in Tensorflow for scaling, so I’d really appreciate some input. Thanks in advance…


(Teng Li) #2

For imagenet, for converged model accuracy, it entirely depends on the total batch size of the model training. In other words, the largest number of ResNet batch size without model accuracy loss I personally have done is 4096 images, even though other people have published about going as large as 18-24K batch size.

So imagine, if you put 32 batch size per GPU, with 4096 images, you will be distributing your distributed job to 128 GPUs. From my own experience, from pure HPC performance and scalability point of view, I have done 64 nodes * 8 GPU/node (512 GPUs) training myself with pretty good linear scalability. Of course, the scalability depends upon your model size, your HPC cluster network setup.

So what is your network interconnect, Infiniband or Ethernet? at what speed and what model are you trying to train?


(Arvind Mohan) #3

Thanks for your input. Certainly interesting that you can see scalability for 512 GPUs - We are looking at really large scale and 5000 GPUs would be the lower bound. Some facts about the hardware:

  1. IBM Power9 with Nvidia Volta V100 GPUs
  2. GPUs have NVLink (Gen 2) interconnect, with 4 GPUs/node.
  3. Mellanox EDR Infiniband.

I want to train a custom architecture which is similar to a convolutional recurrent network - we will be training on several terabytes of training data. Apart from the inherent issues of communication latency etc., is there any limitation in Pytorch as such when I scale up?

I am having difficulty finding resources to learn distributed GPU training with Pytorch (apart from documentation) - What approach did you use? I am currently experimenting with Horovod on a much smaller scale, but I’d prefer using Pytorch’s native distributed module, if I can only get some nice tutorials to study. Thanks again for your insights.