Debugging DataParallel, no speedup and uneven memory allocation

You expect your epoch to become twice as fast with two GPUs. Time per minibatch should stay constant with perfect weak scaling, but you should have two times less minibatches per epoch (if dataset consists of 140 samples, it is 10 minibatches with minibatch size of 14, and 5 minibatches with minibatch size of 28).

oh, my bad. I got confused a whole lot because all my life I am used to seeing mini-batch times. Sorry for the misunderstanding @mattmacy , I apologize.

But as @ngimel and @apaszke pointed out, there are many scenarios in which DataParallel is not great, especially if you have too little compute or if you have too many parameters in your model.

1 Like

@ngimel thank you! Yes when I go from 340 mini batches to 170 minibatches I expect the wall clock time to drop from 22 minutes to 11 minutes. Instead I’m seeing no change.

@smth I think I’m missing something, DataParallel only splits up the forward pass, I would think that in order to see a proper speed up during training I’d need to encapsulate the whole train loop so that all weights from the loss backward pass were averaged and then propagated. This snippet is the logic I’m referring to: http://docs.chainer.org/en/latest/tutorial/gpu.html#data-parallel-computation-on-multiple-gpus-without-trainer Is there some equivalent to their ParallelUpdater that I’m overlooking?

Otherwise, I have 61989982 parameters in my model. I guess that’s too many parameters? Why would DataParallel be rate limited by that? And is it possible I’m doing something wrong such that I’m ending up with more parameters than I mean to have? It’s just 5 level encoder-decoder FCN with 1-3 convolutions at each level and a skip connection from the output of the encoder levels to the level with same resolution and number of channels in the decoder stage. See the diagram from the paper https://github.com/mattmacy/vnet.pytorch/blob/master/images/diagram.png if my description doesn’t make sense.

Thank you for your time.

DataParallel also distributes backward pass, it is hidden in autograd. DataParallel has to broadcast and reduce all the parameters, so parallelization efficiency decreases when you computation time is small and you have a lot of parameters.

1 Like

in the backward pass of DataParallel, we reduce the weights from GPU2 onto GPU1.

Our DataParallel algorithm is roughly like this:

in forward:

  • scatter mini-batch to GPU1, GPU2
  • replicate model on GPU2 (it is already on GPU1)
  • model_gpu1(input_gpu1), model_gpu2(input_gpu2) (this step is parallel_apply)
  • gather output mini-batch from GPU1, GPU2 onto GPU1

in backward:

  • scatter grad_output and input
  • parallel_apply model’s backward pass
  • reduce GPU2 replica’s gradients onto GPU1 model
  • Now there is only a single model again with accumulated gradients from GPU1 and GPU2

  • gather the grad_input

Hence, unlike in Chainer, you do not actually have to have a separate trainer that is aware of DataParallel.
Hope this makes it clear.

wrt why your model is slower via DataParallel, you have 61 million parameters. So, I presume you have some Linear layers at the end (i.e. fully connected layers). Put them outside the purview of DataParallel to avoid having to distribute / reduce those parameter weights and gradients. Here is an example of doing that:

https://github.com/pytorch/examples/blob/master/imagenet/main.py#L68
https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py

When training AlexNet or VGG, we only put model.features in DataParallel, and not the whole model itself, because AlexNet and VGG have large Linear layers at the end of the network.

Maybe your situation is similar?

11 Likes

Is it necessary to replicate all of the gradients? Couldn’t you just replicate the output of the backward pass of just the loss function and then average the results?

There are no fully connected layers. I guess 3d convolutions with >= 128 channels have an exorbitant number of parameters. I may have made an error in going up to 512 when I only meant to go up to 256, so at least you’ve prompted me to take a closer look at the model.

Thanks again.

This is the result of printing the number of parameters for each of the basic elements
Conv3d: 2016
BatchNorm3d: 32
Conv3d: 4128
BatchNorm3d: 64
Conv3d: 128032
BatchNorm3d: 64
Conv3d: 16448
BatchNorm3d: 128
Conv3d: 512064
BatchNorm3d: 128
Conv3d: 512064
BatchNorm3d: 128
Conv3d: 65664
BatchNorm3d: 256
Conv3d: 2048128
BatchNorm3d: 256
Conv3d: 2048128
BatchNorm3d: 256
Conv3d: 2048128
BatchNorm3d: 256
Conv3d: 262400
BatchNorm3d: 512
Conv3d: 8192256
BatchNorm3d: 512
Conv3d: 8192256
BatchNorm3d: 512
Conv3d: 8192256
BatchNorm3d: 512
ConvTranspose3d: 262272
BatchNorm3d: 256
Conv3d: 8192256
BatchNorm3d: 512
Conv3d: 8192256
BatchNorm3d: 512
Conv3d: 8192256
BatchNorm3d: 512
ConvTranspose3d: 131136
BatchNorm3d: 128
Conv3d: 2048128
BatchNorm3d: 256
Conv3d: 2048128
BatchNorm3d: 256
ConvTranspose3d: 32800
BatchNorm3d: 64
Conv3d: 512064
BatchNorm3d: 128
ConvTranspose3d: 8208
BatchNorm3d: 32
Conv3d: 128032
BatchNorm3d: 64
Conv3d: 8002
BatchNorm3d: 4
Conv3d: 6

@smth Revisiting it, the channel split numbers are in fact correct. I tried replacing each of the 5x5x5 filters with 2 3x3x3 filters - which reduced the parameters to 27 million, but it actually increased the memory consumption on the GPU and provided no speed up. So I guess I’ll just have to stick with using the additional GPU for hyperparameter search.

Thanks.

@smth I tried removing some of the largest convolutional layers - with no ill effect - and now DataParallel epochs are taking 15 minutes. Thanks for the explanations!

1 Like

sweet, that’s great news.

Soumith and Adam, I am having a great time exploring PyTorch! Thanks for the awesome library.

I am trying to saturate a 64-core/256-thread CPU in addition to the GPUs. Any pointers on how I can extend Data_parallel.py to create 3 scatters on GPU0, GPU1, and CPU(0-255)?

With Keras I modified this script to saturate the CPU:

@FuriouslyCurious So you want to run the model in parallel on two GPUs and all cores? We don’t have any utility for that, and I don’t think it’s even worth it :confused: The code will be more complex and you’ll probably see hardly any speedup.

1 Like

Due to the statements in his thread, is there no speedup expected for classic DNNs @apaszke ?
I currently run speaker recognition DNN’s with pytorch and increasing the number of GPUs (e.g. 2) used, while at the same time increasing the batchsize ( 256 -> 512 ), does not affect the training time at all. Single GPU training on Switchboard takes me ~300 minutes for a full epoch with a single GPU, as well as with 2 GPUs and doubling the batchsize. The model is a 6 Layer 2048 node DNN. The 2 GPUs are utilized, e.g., are not idle at all.

So for DNN’s there is no speedup expected, if the model parameters are large?

@ngimel @smth I’m running into a situation where using DataParallel ends up training slower than without it and on one gpu (while keeping everything else constant). If I try to increase the batch size, I run out of memory. I’m running an encoder - decoder (with attention) model with 3 million parameters. When running on one gpu, I’m able to run a batch size of 2048 sequences which takes up about 6000mb out of the 6078mb. Whereas on two gpus (using DataParallel on all layers in the encoder and decoder), running the same batch size takes up 6070mb on GPU 1 and only 1022mb on GPU 2.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:06:00.0     Off |                  N/A |
|  0%   51C    P2   167W / 300W |   6070MiB /  6078MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980 Ti  Off  | 0000:07:00.0      On |                  N/A |
|  0%   45C    P2    93W / 300W |   1021MiB /  6075MiB |     35%      Default |
+-------------------------------+----------------------+----------------------+

I have tried putting the linear layers outside of DataParallel to no avail - machine ran out of memory.

I understand that the computations on GPU 1 require more memory than those on GPU 2, but I was expecting more memory to be used on GPU 2. Am I at the maximum capacity / performance? Is there anything I can do with this seq-to-seq model to train more batches at a time or shorten training time.

PCI-e communication latency, transfer overheads, and weight sync on CPU mean that small models don’t benefit from multi-GPU training.

Can you try large image models like ResNet training from examples repo and check if they saturate both GPUs?

Hi @bottanski , I also observed this. Do you have any progress on this? Thank you.

@bottanski @magic282 I am observing the same. And the model is actually not speeded up at all in my scenario.

Hi!

I am also observing that using DataParallel() is not providing speedup when using multiple GPUs. In our case we are using a deepspeech implementation in PyTorch.

For the 5-layer bi-directional GRU (38 million parameters most of which are in the GRU layers) it takes 18 minutes and 6 seconds on 2 GPUs and 18 minutes and 34 seconds on 1 GPU (70 epochs on the smaller an4 dataset). For 1 GPU we used a batchsize of 20 whereas for 2 GPUs we used a batchsize of 40.

Are there any suggestions on how to get faster times with multiple-GPUs?

I can provide more data or run more experiments if that is helpful!

To @smth @ngimel @apaszke @mattmacy

I wonder if you could help me to straighten up my puzzle over the PyTorch Multi-GPU acceleration issue:

Here is what @smth mentioned about how PyTorch training works in Muti-GPU mode:

in forward:


gather output mini-batch from GPU1, GPU2 onto GPU1

in backward:

scatter grad_output and input

However, my personal thought is that:
in a Back Propagation algorithm running on N GPUs, after you finish N forward caculations on each GPU, you do not have to gather the forward together onto GPU 0, caculate the loss on GPU 0, then scatter the grad_output and input to N GPUs before the subsequent backward calculation. I just got confused when trying understand why you do not do what I mentioned in the following – which might speed up the PyTorch Multi-GPU training.
in forward:

scatter mini-batch to GPU1, GPU2
replicate model on GPU2 (it is already on GPU1)
model_gpu1(input_gpu1), model_gpu2(input_gpu2) (this step is parallel_apply)
parallel_apply model’s backward pass
reduce GPU2 replica’s gradients onto GPU1 model

I have read the code here I understand it is working as @smth metioned.
Seems to me that the forward function here has a fixed structure of ‘scatter-caculate-gather’ (or ‘map-then-reduce’). It cannot do ‘scatter-forward calculate-backward calculate-gather’ So the current PyTorch 0.4.0 does not allow some procedure I mentioned above which might futher accelerate the PyTorch, I think. Am I wrong about that?

P.S.: I investigate into this for I noticed that PyTorch Multi-GPU acceleration does not work quite well for me. E.g. for 2GPUs, I can only get 5%-30% acceleration when trying different model topologies.

Any thoughts? :slight_smile:

1 Like

here is a solution about data parallel http://hangzh.com/PyTorch-Encoding/_modules/encoding/parallel.html#ModelDataParallel , and i read it and try a small net use it code but get trash on backward step

hello.
I am curious about the implementation of accumulate the gradients, but I can’t find out where the code that do this job. Can you give a hint ?
Thanks !