Pytroch Batchnorm vs. mxnet Batchnorm


Im trying to convert mxnet network that includes a BatchNorm operation to Pytorch implementation.
I got to the stage that the forward pass is it simmilar (10^-5 error) but the backward pass seems to give different results.

taking a look at pytorch vs mxnet c implementation I noticed that the computation of the weight and bias therms
in pytorch are

if (gradWeight) {
  real val = THTensor_(get1d)(gradWeight, f);
  THTensor_(set1d)(gradWeight, f, val + scale * dotp * invstd);

if (gradBias) {
  real val = THTensor_(get1d)(gradBias, f);
  THTensor_(set1d)(gradBias, f, val + scale * sum);

does someone know why gradBias and GradWeight is sumed to the gradient computation ?

gradBias = gradBias + scale * sum

as far as I understand the gradient should be

gradBias = scale * sum


gradWeight = scale * dotp * invstd

In pytorch, if you want to get just one gradient, you should be calling zero_grad() before calling backward, in that case, the two things are the same since the previous gradBias and gradWeight are 0.

1 Like

Have you found out how to make Batchnorm in PyTorch and MXNet the same? My test shows that the norm of the difference of the outputs from a PyTorch’s Batchnorm layer and a MXNet’s Batchnorm layer is about 0.2. Is that normal? Is it possible to make them the same?