Hi,

Im trying to convert mxnet network that includes a BatchNorm operation to Pytorch implementation.

I got to the stage that the forward pass is it simmilar (10^-5 error) but the backward pass seems to give different results.

taking a look at pytorch vs mxnet c implementation I noticed that the computation of the weight and bias therms

in pytorch are

```
if (gradWeight) {
real val = THTensor_(get1d)(gradWeight, f);
THTensor_(set1d)(gradWeight, f, val + scale * dotp * invstd);
}
if (gradBias) {
real val = THTensor_(get1d)(gradBias, f);
THTensor_(set1d)(gradBias, f, val + scale * sum);
}
```

does someone know why gradBias and GradWeight is sumed to the gradient computation ?

gradBias = gradBias + scale * sum

as far as I understand the gradient should be

gradBias = scale * sum

and

gradWeight = scale * dotp * invstd