Pytroch Batchnorm vs. mxnet Batchnorm


Im trying to convert mxnet network that includes a BatchNorm operation to Pytorch implementation.
I got to the stage that the forward pass is it simmilar (10^-5 error) but the backward pass seems to give different results.

taking a look at pytorch vs mxnet c implementation I noticed that the computation of the weight and bias therms
in pytorch are

if (gradWeight) {
  real val = THTensor_(get1d)(gradWeight, f);
  THTensor_(set1d)(gradWeight, f, val + scale * dotp * invstd);

if (gradBias) {
  real val = THTensor_(get1d)(gradBias, f);
  THTensor_(set1d)(gradBias, f, val + scale * sum);

does someone know why gradBias and GradWeight is sumed to the gradient computation ?

gradBias = gradBias + scale * sum

as far as I understand the gradient should be

gradBias = scale * sum


gradWeight = scale * dotp * invstd

In pytorch, if you want to get just one gradient, you should be calling zero_grad() before calling backward, in that case, the two things are the same since the previous gradBias and gradWeight are 0.

Have you found out how to make Batchnorm in PyTorch and MXNet the same? My test shows that the norm of the difference of the outputs from a PyTorch’s Batchnorm layer and a MXNet’s Batchnorm layer is about 0.2. Is that normal? Is it possible to make them the same?