Help understanding Batchnorm

Arun_Vishwanathan · July 15, 2019, 11:28pm

I have a Pytorch model consisting of a convolution2d followed by BatchNorm2d and I am printing the output of each layer in the forward pass.

I cannot seem to understand the result of the output of BatchNorm based on the values of weight and bias it holds.

The following is the outputs as printed in Pytorch(conv output (which is also input to BatchNorm )and BatchNorm output):

 tensor([[[[-0.0403,  0.0103,  0.0185],
          [ 0.0240,  0.0535,  0.0137],
          [ 0.0233,  0.0239, -0.0202]],

         [[-0.1044, -0.1664, -0.2347],
          [-0.1708, -0.2092, -0.2356],
          [-0.2202, -0.2412, -0.2733]]]], grad_fn=<MkldnnConvolutionBackward>)

 tensor([[[[-1.6799, -0.0496,  0.2127],
          [ 0.3922,  1.3428,  0.0598],
          [ 0.3674,  0.3883, -1.0339]],

         [[ 0.4344,  0.1697, -0.1216],
          [ 0.1510, -0.0127, -0.1253],
          [-0.0596, -0.1495, -0.2863]]]], grad_fn=<NativeBatchNormBackward>)

the outputs were printed from the forward function as:

x1 = self.conv1(x)
print(x1)
x2 = self.bn(x1)        
print(x2)

Now when I print the weights and bias respectively of the BatchNorm layer it shows this:

Parameter containing:

tensor([0.8352, 0.2056], requires_grad=True)

Parameter containing:

tensor([0., 0.], requires_grad=True)

If BatchNorm is (weights*previoustensor + bias), then the first output value should have been (0.8352 * -0.0403) + 0 = -0.0336 but it shows -1.6799

Could someone please explain? I ask this as one of my colleagues pointed this out. In our internal code, our output is indeed -0.033 for the first index so we wanted to understand what was the value reasoning behind Pytorch or if there are other factors involved.

Arun_Vishwanathan · July 16, 2019, 1:26am

I think I figured this out. Someone can confirm:

it basically normalizes the output from conv per channel so that we have C means and variances. It then adjusts the output of conv by subtracting mean (for that channel) and dividing by variance (for that channel) and then multiplies by result by the weight of Batchnorm for that channel to get the value.

ptrblck · July 16, 2019, 9:58am

Yes, that’s the applied method in the train() mode.
Additionally the bias is also added to the result.

If you call model.eval(), the running estimates will be used to normalize the input instead of the current batch statistic.

Arun_Vishwanathan · July 16, 2019, 8:34pm

thanks! I am trying to ensure that BatchNorm is used in training mode and is frozen because there are other layers that will be updated. Do I need to do this to the module after the net object is created??

net.bn.weight.requires_grad=False
net.bn.bias.requires_grad=False
net.bn.train()

ptrblck · July 16, 2019, 9:23pm

If you don’t want to train the affine parameters at all (weight and bias), you could just initialize the batch norm layer with affine=False.
Otherwise to disable their updates temporarily, you could set the .requires_grad attribute to False as shown in your example.

Arun_Vishwanathan · July 16, 2019, 10:07pm

Ok I will try that. But is net.bn.train() absolutely required so that it behaves as though the layer is in not in eval mode? If I am not wrong all modules are in train() mode by default maybe then this is not needed. On the contrary if I needed to do inference, I would compulsorily require net.bn.eval() ?

ptrblck · July 16, 2019, 10:10pm

Yes, that’s right. All modules are in training mode by default after initialization.
Sorry, I’ve overlooked the last line of code.

For inference, I would rather call net.eval(), which will set all modules recursively to evaluation mode.