Hi,
Im trying to convert mxnet network that includes a BatchNorm operation to Pytorch implementation.
I got to the stage that the forward pass is it simmilar (10^-5 error) but the backward pass seems to give different results.
taking a look at pytorch vs mxnet c implementation I noticed that the computation of the weight and bias therms
in pytorch are
if (gradWeight) {
real val = THTensor_(get1d)(gradWeight, f);
THTensor_(set1d)(gradWeight, f, val + scale * dotp * invstd);
}
if (gradBias) {
real val = THTensor_(get1d)(gradBias, f);
THTensor_(set1d)(gradBias, f, val + scale * sum);
}
does someone know why gradBias and GradWeight is sumed to the gradient computation ?
gradBias = gradBias + scale * sum
as far as I understand the gradient should be
gradBias = scale * sum
and
gradWeight = scale * dotp * invstd