thank you very much for your reply. @crcrpar
1.so you mean running mean is updated during forward process not backward process ?
2.why do you think they are different? do you mean that the accumulated gradients should be divided by 10? any other difference ?