Why DataParallel works poorly for CNNs compared to LSTMs?


(Alexis W) #1

As titled: is there something that is hard to be paralleled for CNNs?


(Alexis W) #2

Any ideas for this? Is there something wrong with my code?


#3

Could you explain a bit more, what you are experiencing?
Is your training slow using a CNN?
How did you time the training?


(Alexis W) #4

It is like the time for training a CNN model in multiple GPUs is roughly the same compared to training using one single GPU. But for LSTMs there will be a difference. Is that normal? Thanks!


#5

The time for each iteration or epoch?
In the former case this would be a perfectly linear speedup.
In the latter case, the bottleneck might be your data loading, e.g. loading from a HDD instead of a SSD.

Are you using the same DataLoader for the CNN and LSTM run?


(Alexis W) #6

I checked the time when I called something like for i in data_loader and that is pretty fast. The majority of time was spent at the step result = model(data) and optimizer.step() so I am not sure what happened. It does not seem to be a data loader issue.

I track time for 50 steps so I think it is close to your later case.


#7

So the 50 steps using multiple GPUs take the same time as 50 steps using a single GPU, e.g. 1 minute?
Assuming you’ve scaled up your batch size for DataParallel this would be perfectly fine, as your wall time for an epoch will be divided by your number of GPUs now.


(Alexis W) #8

yes, roughly.
Should I scale the batch size up? I am wondering a too-large batch size leads to bad performance (say 128 -> 1024 with 8 GPUs).


#9

Your data will be split across the devices by chunking in the batch dimension.
If your single model worked good for a batch size of e.g. 128, you could use a batch size of 128*4 for 4 GPUs.
Each model will get a batch of 128 samples, so that the performance should not change that much.


(Alexis W) #10

Okay. Sorry I am still a little bit confused here. You said that each model will get 128 samples, however, at the backward step, how will the training work? Will that be something like taking the sum from each GPU?


#11

Have a look at this explanation of the general parallel operations.