Debugging DataParallel, no speedup and uneven memory allocation

in the backward pass of DataParallel, we reduce the weights from GPU2 onto GPU1.

Our DataParallel algorithm is roughly like this:

in forward:

  • scatter mini-batch to GPU1, GPU2
  • replicate model on GPU2 (it is already on GPU1)
  • model_gpu1(input_gpu1), model_gpu2(input_gpu2) (this step is parallel_apply)
  • gather output mini-batch from GPU1, GPU2 onto GPU1

in backward:

  • scatter grad_output and input
  • parallel_apply model’s backward pass
  • reduce GPU2 replica’s gradients onto GPU1 model
  • Now there is only a single model again with accumulated gradients from GPU1 and GPU2

  • gather the grad_input

Hence, unlike in Chainer, you do not actually have to have a separate trainer that is aware of DataParallel.
Hope this makes it clear.

wrt why your model is slower via DataParallel, you have 61 million parameters. So, I presume you have some Linear layers at the end (i.e. fully connected layers). Put them outside the purview of DataParallel to avoid having to distribute / reduce those parameter weights and gradients. Here is an example of doing that:

https://github.com/pytorch/examples/blob/master/imagenet/main.py#L68
https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py

When training AlexNet or VGG, we only put model.features in DataParallel, and not the whole model itself, because AlexNet and VGG have large Linear layers at the end of the network.

Maybe your situation is similar?

11 Likes