I am trying to training a SSD(vgg based) for facial detection.
Then I just find out a problem:
When I train the model on a single GPU with batch size = 4, I get a better result than train on 3 GPUs with batch size = 6.
The learning rate settings and the training iterations are exactly the same in both training.
However, I expected that the 3GPU’s should get a better result because of a larger batch size.
Can someone help me to find out the reason?
I believe you are right, a higher batch size should give a better result. However, a higher batch size decreases the amount of optimization steps.
So try increasing the number of optimization steps / epochs by 50% and tell us how the results fair then.
Or do you possible have batch norm layers in your model? If you do, a batch size of 6 on 3 GPUs mean each GPU get 2 images -> not so nice for batch norm
Ok cool. Then that is already a fair comparison. Do you by any chance use BatchNorm or any other normalization layer depending on a large batch size? Otherwise I don’t know what could be “wrong”
Under the fixed training steps (not epochs), intuitively, train model using batch size=6 on 3GPUs should get a better result. (without batch normalisation)
However, from the results, training using batch size = 4 on a single GPU gets a better result.