Mini-batch size and scaling

dmw64 · August 9, 2019, 9:26pm

Hi guys,

I have simple and basic question regarding the to calculation of gradients: Let’s say I have mini-batch of N elements all leading to N individual losses which I want accumulated to a single loss.

My question: Does autograd average the gradients coming from a mini-batch (corresponding to the batch-size) or does it just sums all gradients up?

It seems that there is no averaging but pure summing. But I’d like to get a confirmation.

Here is what I mean: Let us consider a mini-batch of N equal elements. If I sum all the N individual losses and let autograd calculate gradients and do an optimizer-step (I tried this with pure SDG). Then the gradient step will scale with N (so N=2 leads to a step twice as big as N=1). Hence, to get something mini-batch size independent, I should sure the average of the individual losses, right?

Thanks a lot!

ptrblck · August 9, 2019, 10:42pm

The loss is averaged by default using a criterion, e.g. nn.CrossEntropyLoss or nn.MSELoss. This would make the gradients independent regarding the batch size. You could change this behavior by setting the reduction argument to e.g. 'sum'.

dmw64 · August 10, 2019, 5:07am

Great - that is what I needed to know. Thank you!