Training speed on Single GPU vs Multi-GPUs

sungtae · June 20, 2018, 3:37am

Dear members,

While I was doing some practices to be familiar with multi GPU training, I found out that there is a difference not only in running time but also in training speed, e.g. number of epochs taken to get the same level of loss/accuracy.

I have trained a custom, yet simple, CNN model on a custom dataset of 6-classes classification problem with a single GPU and two GPUs separately.
Everything is same except

model = nn.DataParallel(model)

in multi GPU training.

Attached figure contains loss (NLL loss after log-softmax) and accuracy plot over epochs.
ORANGE line is for single GPU and BLUE line is for multi GPU training.
I was intentionally overfitting the training set so that I can check if my draft model was being trained, so you do not need to worry about overfitting shown in the plots.

The point is that it is clear training with multiple GPUs converges quicker than with a single GPU.
I am wondering whether it is expected result since I thought that using DataParallel is just splitting batch into chunks fed into each GPU and losses from backward pass are just collected to be summed/averaged together.

Thanks!

enisberk · October 31, 2018, 12:24am

I think this is interesting, did you find out reason behind these results ?

Felix_Lessange · April 8, 2019, 1:37pm

WIth two gpus for the same batch size, your batch is divided in two, so the batch size seen by each gpu is two times inferior by the one seen by your single gpu in your first experiment. Since batch norms tensors are specific to a gpu, this means that the batch size seen by batch norm layers is divided by two. This could explain the difference.

James_Condon · July 5, 2019, 5:02am

Hi @sungtae,

Thanks for this. What is your batch size? what size are your images? and how ‘difficult’ is your dataset and your classes?

How many num_workers are you using for the dataloader?

I’m interested as I’m getting unexpected times per epoch for large (medical) images, same batch size, iterating over ranges of num_workers with different GPUs. 1 GPU is a lot quicker per epoch than 2 GPUs???

Cheers.