hi

due to limited gpu memory , i want to accumulate gradients in some iterations and back propagate to work as large batch. for example,now the batch size is 2, and i backward after five iterations, which is the same with a batch of 10. However, what is running mean of BN layer in this process? Will pytorch average the 10 data or only take the average of the last mini-batch (2 in this case ) as the running mean?

Hi, @Zhang_Chi

Batch Normalization updates its running mean and variance every call of `forward`

method.

Also, by default, BatchNorm updates its running mean by `running_mean = alpha * mean + (1 - alpha) * running_mean`

(the details are here).

As to accumulating gradients, this thread “How to implement accumulated gradient？” might help you.

As a side note, I don’t think the accumulated gradients and the gradient will be the same in the example.

thank you very much for your reply. @crcrpar

1.so you mean running mean is updated during forward process not backward process ?

2.why do you think they are different? do you mean that the accumulated gradients should be divided by 10? any other difference ?

- Yes.
- Accumulated gradients will be the same if you divide them by the number of iterations. I referred below.

yeah. it there a quick way to do that? what i can think it to loop all the parameters’ gradients and divide each of them by the number of iterations. Or is it the same if I divide the loss to be backpropogated by the number of iterations?

yes it’ll be the same.

Glad to hear that!

@crcrpar

hi,why calculate running_mean as:

```
running_mean = alpha * mean + (1 - alpha) * running_mean
```

not just use:

```
running_mean = sum(mean of every batch)/batch counts
```

what is the difference or what is the benifit of ` running_mean = alpha * mean + (1 - alpha) * running_mean`