Why DataParallel works poorly for CNNs compared to LSTMs?

AlexisW · September 16, 2018, 12:13am

As titled: is there something that is hard to be paralleled for CNNs?

AlexisW · September 16, 2018, 6:26pm

Any ideas for this? Is there something wrong with my code?

ptrblck · September 16, 2018, 6:54pm

Could you explain a bit more, what you are experiencing?
Is your training slow using a CNN?
How did you time the training?

AlexisW · September 16, 2018, 7:17pm

It is like the time for training a CNN model in multiple GPUs is roughly the same compared to training using one single GPU. But for LSTMs there will be a difference. Is that normal? Thanks!

ptrblck · September 16, 2018, 7:21pm

The time for each iteration or epoch?
In the former case this would be a perfectly linear speedup.
In the latter case, the bottleneck might be your data loading, e.g. loading from a HDD instead of a SSD.

Are you using the same DataLoader for the CNN and LSTM run?

AlexisW · September 16, 2018, 8:37pm

I checked the time when I called something like for i in data_loader and that is pretty fast. The majority of time was spent at the step result = model(data) and optimizer.step() so I am not sure what happened. It does not seem to be a data loader issue.

I track time for 50 steps so I think it is close to your later case.

ptrblck · September 16, 2018, 8:41pm

So the 50 steps using multiple GPUs take the same time as 50 steps using a single GPU, e.g. 1 minute?
Assuming you’ve scaled up your batch size for DataParallel this would be perfectly fine, as your wall time for an epoch will be divided by your number of GPUs now.

AlexisW · September 16, 2018, 8:43pm

yes, roughly.
Should I scale the batch size up? I am wondering a too-large batch size leads to bad performance (say 128 -> 1024 with 8 GPUs).

ptrblck · September 16, 2018, 8:46pm

Your data will be split across the devices by chunking in the batch dimension.
If your single model worked good for a batch size of e.g. 128, you could use a batch size of 128*4 for 4 GPUs.
Each model will get a batch of 128 samples, so that the performance should not change that much.

AlexisW · September 16, 2018, 8:48pm

Okay. Sorry I am still a little bit confused here. You said that each model will get 128 samples, however, at the backward step, how will the training work? Will that be something like taking the sum from each GPU?

ptrblck · September 16, 2018, 10:07pm

Have a look at this explanation of the general parallel operations.