What is the running mean of BatchNorm if gradients are accumulated?

Zhang_Chi · May 30, 2018, 2:41am

hi
due to limited gpu memory , i want to accumulate gradients in some iterations and back propagate to work as large batch. for example,now the batch size is 2, and i backward after five iterations, which is the same with a batch of 10. However, what is running mean of BN layer in this process? Will pytorch average the 10 data or only take the average of the last mini-batch (2 in this case ) as the running mean?

crcrpar · May 30, 2018, 4:15am

Hi, @Zhang_Chi

Batch Normalization updates its running mean and variance every call of forward method.
Also, by default, BatchNorm updates its running mean by running_mean = alpha * mean + (1 - alpha) * running_mean (the details are here).

As to accumulating gradients, this thread “How to implement accumulated gradient？ - #8 by Gopal_Sharma” might help you.

As a side note, I don’t think the accumulated gradients and the gradient will be the same in the example.

Zhang_Chi · May 30, 2018, 4:29am

thank you very much for your reply. @crcrpar
1.so you mean running mean is updated during forward process not backward process ?
2.why do you think they are different? do you mean that the accumulated gradients should be divided by 10? any other difference ?

crcrpar · May 30, 2018, 4:53am

Yes.
Accumulated gradients will be the same if you divide them by the number of iterations. I referred below.

Zhang_Chi · May 30, 2018, 5:15am

yeah. it there a quick way to do that? what i can think it to loop all the parameters’ gradients and divide each of them by the number of iterations. Or is it the same if I divide the loss to be backpropogated by the number of iterations?

crcrpar · May 30, 2018, 6:48am

yes it’ll be the same.

Zhang_Chi · May 30, 2018, 7:23am

thank you so much @crcrpar

crcrpar · May 30, 2018, 9:35am

Glad to hear that!

du_s · January 16, 2020, 3:36am

@crcrpar
hi,why calculate running_mean as:

running_mean = alpha * mean + (1 - alpha) * running_mean

not just use:

running_mean = sum(mean of every batch)/batch counts

what is the difference or what is the benifit of running_mean = alpha * mean + (1 - alpha) * running_mean