Debugging DataParallel, no speedup and uneven memory allocation

Hi!

I am also observing that using DataParallel() is not providing speedup when using multiple GPUs. In our case we are using a deepspeech implementation in PyTorch.

For the 5-layer bi-directional GRU (38 million parameters most of which are in the GRU layers) it takes 18 minutes and 6 seconds on 2 GPUs and 18 minutes and 34 seconds on 1 GPU (70 epochs on the smaller an4 dataset). For 1 GPU we used a batchsize of 20 whereas for 2 GPUs we used a batchsize of 40.

Are there any suggestions on how to get faster times with multiple-GPUs?

I can provide more data or run more experiments if that is helpful!