Debugging DataParallel, no speedup and uneven memory allocation

PCI-e communication latency, transfer overheads, and weight sync on CPU mean that small models don’t benefit from multi-GPU training.

Can you try large image models like ResNet training from examples repo and check if they saturate both GPUs?

Hi @bottanski , I also observed this. Do you have any progress on this? Thank you.

@bottanski @magic282 I am observing the same. And the model is actually not speeded up at all in my scenario.


I am also observing that using DataParallel() is not providing speedup when using multiple GPUs. In our case we are using a deepspeech implementation in PyTorch.

For the 5-layer bi-directional GRU (38 million parameters most of which are in the GRU layers) it takes 18 minutes and 6 seconds on 2 GPUs and 18 minutes and 34 seconds on 1 GPU (70 epochs on the smaller an4 dataset). For 1 GPU we used a batchsize of 20 whereas for 2 GPUs we used a batchsize of 40.

Are there any suggestions on how to get faster times with multiple-GPUs?

I can provide more data or run more experiments if that is helpful!

To @smth @ngimel @apaszke @mattmacy

I wonder if you could help me to straighten up my puzzle over the PyTorch Multi-GPU acceleration issue:

Here is what @smth mentioned about how PyTorch training works in Muti-GPU mode:

in forward:

gather output mini-batch from GPU1, GPU2 onto GPU1

in backward:

scatter grad_output and input

However, my personal thought is that:
in a Back Propagation algorithm running on N GPUs, after you finish N forward caculations on each GPU, you do not have to gather the forward together onto GPU 0, caculate the loss on GPU 0, then scatter the grad_output and input to N GPUs before the subsequent backward calculation. I just got confused when trying understand why you do not do what I mentioned in the following – which might speed up the PyTorch Multi-GPU training.
in forward:

scatter mini-batch to GPU1, GPU2
replicate model on GPU2 (it is already on GPU1)
model_gpu1(input_gpu1), model_gpu2(input_gpu2) (this step is parallel_apply)
parallel_apply model’s backward pass
reduce GPU2 replica’s gradients onto GPU1 model

I have read the code here I understand it is working as @smth metioned.
Seems to me that the forward function here has a fixed structure of ‘scatter-caculate-gather’ (or ‘map-then-reduce’). It cannot do ‘scatter-forward calculate-backward calculate-gather’ So the current PyTorch 0.4.0 does not allow some procedure I mentioned above which might futher accelerate the PyTorch, I think. Am I wrong about that?

P.S.: I investigate into this for I noticed that PyTorch Multi-GPU acceleration does not work quite well for me. E.g. for 2GPUs, I can only get 5%-30% acceleration when trying different model topologies.

Any thoughts? :slight_smile:

1 Like

here is a solution about data parallel , and i read it and try a small net use it code but get trash on backward step

I am curious about the implementation of accumulate the gradients, but I can’t find out where the code that do this job. Can you give a hint ?
Thanks !