Training MobileNet on multi GPUs is slow using Pytorch

I was trying to train MobileNet on multi gpus using Pytorch. From watch nvidia-smi , I see that GPUs are sometimes working, and sometimes not working(GPU util is 0%). This slows down training speed a lot.

But training MobileNet on a single GPU and training ResNet50 on multi GPUs do not have such issue. I was wondering what is going wrong. Is there someone used to meet this problem?


  1. Pytorch version is 0.4.0
  2. I read all training data into memory.
  3. I have also tried Keras, it does not have such issue.

It might be you are seeing an overhead scattering and gathering the whole model.
You could just parallelize the feature layers, if that’s possible.
Have a look at the ImageNet example where this is also performed.
Krizhevsky published this approach in his One weird trick paper.