Debugging DataParallel, no speedup and uneven memory allocation


(Furiously Curious) #23

PCI-e communication latency, transfer overheads, and weight sync on CPU mean that small models don’t benefit from multi-GPU training.

Can you try large image models like ResNet training from examples repo and check if they saturate both GPUs?


#24

Hi @bottanski , I also observed this. Do you have any progress on this? Thank you.


(Chenyang Huang) #25

@bottanski @magic282 I am observing the same. And the model is actually not speeded up at all in my scenario.


(Udit Gupta) #26

Hi!

I am also observing that using DataParallel() is not providing speedup when using multiple GPUs. In our case we are using a deepspeech implementation in PyTorch.

For the 5-layer bi-directional GRU (38 million parameters most of which are in the GRU layers) it takes 18 minutes and 6 seconds on 2 GPUs and 18 minutes and 34 seconds on 1 GPU (70 epochs on the smaller an4 dataset). For 1 GPU we used a batchsize of 20 whereas for 2 GPUs we used a batchsize of 40.

Are there any suggestions on how to get faster times with multiple-GPUs?

I can provide more data or run more experiments if that is helpful!