What is the running mean of BatchNorm if gradients are accumulated?

due to limited gpu memory , i want to accumulate gradients in some iterations and back propagate to work as large batch. for example,now the batch size is 2, and i backward after five iterations, which is the same with a batch of 10. However, what is running mean of BN layer in this process? Will pytorch average the 10 data or only take the average of the last mini-batch (2 in this case ) as the running mean?


Hi, @Zhang_Chi

Batch Normalization updates its running mean and variance every call of forward method.
Also, by default, BatchNorm updates its running mean by running_mean = alpha * mean + (1 - alpha) * running_mean (the details are here).

As to accumulating gradients, this thread “How to implement accumulated gradient?” might help you.

As a side note, I don’t think the accumulated gradients and the gradient will be the same in the example.


thank you very much for your reply. @crcrpar
1.so you mean running mean is updated during forward process not backward process ?
2.why do you think they are different? do you mean that the accumulated gradients should be divided by 10? any other difference ?

  1. Yes.
  2. Accumulated gradients will be the same if you divide them by the number of iterations. I referred below.
1 Like

yeah. it there a quick way to do that? what i can think it to loop all the parameters’ gradients and divide each of them by the number of iterations. Or is it the same if I divide the loss to be backpropogated by the number of iterations?

yes it’ll be the same.

thank you so much @crcrpar

Glad to hear that!:grin:

hi,why calculate running_mean as:

running_mean = alpha * mean + (1 - alpha) * running_mean

not just use:

running_mean = sum(mean of every batch)/batch counts

what is the difference or what is the benifit of running_mean = alpha * mean + (1 - alpha) * running_mean