Does batchsize=6 on 3 GPUs get a worse result than on a single GPU with batchsize=4


I am trying to training a SSD(vgg based) for facial detection.
Then I just find out a problem:
When I train the model on a single GPU with batch size = 4, I get a better result than train on 3 GPUs with batch size = 6.

The learning rate settings and the training iterations are exactly the same in both training.

However, I expected that the 3GPU’s should get a better result because of a larger batch size.
Can someone help me to find out the reason?

I believe you are right, a higher batch size should give a better result. However, a higher batch size decreases the amount of optimization steps.

So try increasing the number of optimization steps / epochs by 50% and tell us how the results fair then.

Or do you possible have batch norm layers in your model? If you do, a batch size of 6 on 3 GPUs mean each GPU get 2 images -> not so nice for batch norm

Thanks for the reply.

Actually, I train both of them using 5r-4(0-80k), 5e-5(80k-100k), 5e-6(100k-120k) so the optimisation steps should be the same for both of them?

Or do you mean I should add more steps for 3GPUs’ training?

Hmm, not sure what you mean by the 5e-4(0-80k) but I’m assuming that you mean the learning rate for different optimization step ranges.

I’m saying that you should keep the number of steps the same, which isn’t true if you run your training by number of epochs

Consider a dataset that has 100 images and 50 epochs.

If we have a batch size of 10, that dataset divides into 10 batches -> 10 steps per epoch = 10*50 steps

If we set the batch size to 20, that dataset divides into 5 batches -> 5 steps per epoch = 5*50 steps

Sorry about my confusing explanation.
This is my learning rate and training steps’ setting:

0-80k steps, learning rate = 5e-4;
0-100k steps, learning rate = 5e-5;
0-120k steps, learning rate = 5e-6;

So the total training steps are the same for both of them.

Ok cool. Then that is already a fair comparison. Do you by any chance use BatchNorm or any other normalization layer depending on a large batch size? Otherwise I don’t know what could be “wrong”

Under the fixed training steps (not epochs), intuitively, train model using batch size=6 on 3GPUs should get a better result. (without batch normalisation)
However, from the results, training using batch size = 4 on a single GPU gets a better result.

That is the part I am confusing.

Yup, that confuses me as well. Hopefully someone else might be able to help