How does pytorch count batch size in SGD with multiGPU when it does Batch Norm?

When pytorch does the batch normalization in SGD, what batch size it use? Per GPU batch size, or total batch size over all GPUs?
For example, suppose I set the per-GPU batch size as 32, and I use 8 GPUs. When Pytorch does batch normalization, what batch size it use, 32 or 32 x 8?
My concern comes from a facebook’s paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”. If Pytorch’s implementation is the same with what this paper describes, I can modify the parameters of my model accordingly; if not, I need to know. Thanks!

1 Like

I am pretty sure that if you e.g., set the minibatch size to 32 in the dataloader and have say 8 GPUs, you get 4 data points per GPU during DataParallel (you can kind of see that based on the number of training instances per minibatch and also memory use) . I am wondering if you are referring to DataParallel in your question?

In practice, I found that yes, using e.g., 4 GPUs can maybe speed up computation by ~2.5 times, but then I also need to train 1.5 more epochs to get comparable results (compared to 1 GPU).

Also, I usually use minibatchsize * number of GPUs as my minibatch param to make best use of your GPUs. I.e., if you have a model with a current batch size that fills up 80% of the GPU memory, and you want to maintain that level of utilization, you’d need to increase the batch size then of course as the batch gets scattered across the GPUs.

1 Like


Thanks for you detailed explanation.

What does the minibatch param means ? :thinking: the epochs the model to train?

Thank you.

sorry that was maybe a bit confusing. With minibatch param, I meant the minibatch size. E.g., if my minibatch size is 256 for 1GPU, I use 4*256 for four GPUs.

1 Like

What I’m asking is which batch size is used in Batch Normalization, when there are more than one GPU? Does Pytorch do batch normalization per GPU, or it do batch normalization accross all GPUs?

For example, let’s assume batch size per GPU is 32, and there are 4 GPUs, so, total batch size is 32 * 4 =128. So, when Pytorch does batch normalization, which ‘batch’ does it use? 32 or 128?

Yeah, this is also what I found. So you mean we cannot implement linear acceleration using the current PyTorch?