Alexnet training on MultiGPU is not faster

lattuada · March 24, 2018, 6:21pm

Hi all,

I am trying to measure the speedup which can be obtained in alexnet training by exploiting multiple GPUs.
I am using this code https://github.com/pytorch/examples/blob/master/imagenet/main.py without any modification. The training set is taken from imagenet (Large Scale Visual Recognition Challenge 2012) and is composed of 100 classes with 1000 images. Batch size is 1024, Number of epochs is 10. I am running the same code with the same parameter and inputs on Microsoft Azure NV6 (1 x NVIDIA Tesla M60) and NV12 (2 x NVIDIA Tesla M60). Installed pytorch is 0.3.1 (with CUDA 9.0).

I would expect that code runs faster on 2 GPUs (not 2x but at least a significant speedup), but on the contrary, I am obtaining the same execution times, around 3500 seconds.

Is there any parameter which I am not correctly setting or something that I have to modify in order to actually exploit all the available GPUs?

Thanks

ptrblck · March 24, 2018, 11:06pm

Have you doubled your batch size?
In this example you can see, that the batch will be split among the GPUs.
If you kept the batch size as 1024, both GPUs only got 512 and thus your training couldn’t run any faster than using one GPU.

lattuada · March 25, 2018, 11:50am

No, I didn’t double the batch size.
But since each GPU takes half of the batch at each iteration, each iteration should be faster and since the number of the iterations remains the same the whole process should be faster.

ptrblck · March 25, 2018, 2:01pm

Yeah, you have a point.
Have you tried to time the processing of the different sizes on one GPU? It would be interesting to see, what the speedup should be.
Don’t forget to synchronize the GPU calls before stopping the timer.

lattuada · March 25, 2018, 2:43pm

I did some tests on a single GPU and the results were the expected ones (e.g., if I reduce the batch size and the number of input of images so that the number of iterations remains the same, then the execution time is smaller).
The execution time in all the tests is collected outside pytorch, so all the copy/synchronization issues should be taken into account.

ptrblck · March 25, 2018, 9:39pm

Are you using exactly the same code as mentioned in your first post or have you modified anything?

tjoseph · March 25, 2018, 10:22pm

Maybe since AlexNet has large fully connected layers, synchronization is actually a bigger bottleneck then computation speed and that’s why it gets slower with more GPUs?

ptrblck · March 25, 2018, 10:24pm

I thought the same and wanted to test it myself.
Did you have any experience with that?
I’m currently traveling and the Internet connection is quite slow, which makes working via remote quite painful.

lattuada · March 26, 2018, 6:41am

I am using exactly the code of https://github.com/pytorch/examples/blob/master/imagenet/main.py
Moreover, according to Debugging DataParallel, no speedup and uneven memory allocation and https://github.com/pytorch/examples/blob/master/imagenet/main.py#L81, the fully connected layers of Alexnet are already implemented on single GPU to overcome synchronization issues.

swibe · March 26, 2018, 7:38am

I’ve used that script to train AlexNet before - I found that it’s such a simple model that it’s very easy to be limited by CPUs and/or disk access, and the number of GPUs is often irrelevant.

You can try increasing the number of threads which fetch the data (e.g. -j16), or you could try using a larger model if you want to check whether multi-GPU speedups are possible (e.g. --arch=resnet50).