hi
due to limited gpu memory , i want to accumulate gradients in some iterations and back propagate to work as large batch. for example,now the batch size is 2, and i backward after five iterations, which is the same with a batch of 10. However, what is running mean of BN layer in this process? Will pytorch average the 10 data or only take the average of the last mini-batch (2 in this case ) as the running mean?
Hi, @Zhang_Chi
Batch Normalization updates its running mean and variance every call of forward
method.
Also, by default, BatchNorm updates its running mean by running_mean = alpha * mean + (1 - alpha) * running_mean
(the details are here).
As to accumulating gradients, this thread “How to implement accumulated gradient? - #8 by Gopal_Sharma” might help you.
As a side note, I don’t think the accumulated gradients and the gradient will be the same in the example.
thank you very much for your reply. @crcrpar
1.so you mean running mean is updated during forward process not backward process ?
2.why do you think they are different? do you mean that the accumulated gradients should be divided by 10? any other difference ?
- Yes.
- Accumulated gradients will be the same if you divide them by the number of iterations. I referred below.
yeah. it there a quick way to do that? what i can think it to loop all the parameters’ gradients and divide each of them by the number of iterations. Or is it the same if I divide the loss to be backpropogated by the number of iterations?
yes it’ll be the same.
thank you so much @crcrpar
Glad to hear that!
@crcrpar
hi,why calculate running_mean as:
running_mean = alpha * mean + (1 - alpha) * running_mean
not just use:
running_mean = sum(mean of every batch)/batch counts
what is the difference or what is the benifit of running_mean = alpha * mean + (1 - alpha) * running_mean