Example on how to use batch-norm?

Can we see the BatchNorm gamma & beta, parameters?
Are they, trained during backpropagation ?

Obrigado,

Is there some way to only fix the running_mean and running_var, while updating the lambda and beta (weights and bias) in the training stage? Thanks

Gamma and beta should correspond to .weight and .bias, respectively.
Both are trained.

I came here regarding batchnorm and found what I needed, but hey, I thought I might give an answer to that “unrelated question”:

Typically, dropout is applied after the non-linear activation function (a). However, when using rectified linear units (ReLUs), it might make sense to apply dropout before the non-linear activation (b) for reasons of computational efficiency depending on the particular code implementation.

(a): Fully connected, linear activation -> ReLU -> Dropout -> …
(b): Fully connected, linear activation -> Dropout -> ReLU -> …

e.g., say we have the following activations in our hidden layer: [-1, -2, -3, 4, 5, 6]. Tthe output of the ReLU function is the following:

[0, 0, 0, 4, 5, 6]

then, with dropout, we might get the following with a 50% drop proba:

[0*2, 0, 0*2, 0, 0*2, 0] = [0, 0, 0, 0, 10, 0]

Now, if we would pass the input to the ReLU first and then the dropout, we would get the exact same results:

[-1, -2, -3, 4, 5, 6] -> [-1*2, 0, -3*2, 0, 5*2, 0]

[-2, 0, -6, 0, 10, 0] -> [0, 0, 0, 0, 10, 0]

3 Likes

But in the pytorch documentation, there is an example of “ConvNet as fixed feature extractor” where the features are obtained from the pretrained resnet model and they only set requires_grad to False to freeze the whole network. Are you sure that requires_grad=False, only freezes the parameters, and not the moving averages?

I didn’t understand the 3rd point. Could you please explain it again?

Why don’t you add a batch-norm layer after conv1 layer?

What happens when we put batchnorm in eval mode? Does it use mean and variance of the whole dataset for evaluation time or we have to calculate them manually or what? Any example to use batchnorm in evaluation time for only one sample? I don’t have a batch in eval time. One sample per time

During eval the running estimates will be used, which were updated during training using the batch mean and std.
You can see these stats using bn.running_mean and bn.running_var.
Since these running estimates do not depend on the batch size anymore, you are fine using single samples during validation and testing.

1 Like

Thank you. So just using with torch.no_grad(): and model.eval() and call the model with batchnorm is ok? I remember when I used the trained model and used these two commands, the test results were too bad. After that, I decided to remove the bn layer and train the model again and never used bn again. But I was looking for a solution. I have to check it again.

Thanks

Yes, it should be OK. However, if your running stats are skewed e.g. due to a small batch size during training or if the training and validation data comes from different distributions, you might get bad results.
One way to counter this effect would be to change the momentum in the batchnorm layers, so that each update of the running estimates is smaller.
If your batch size is “small”, you might get better results with e.g. nn.GroupNorm or nn.InstanceNorm.

2 Likes

Thank you for great hints.

@ptrblck i was going through your implementation of batch norm https://github.com/ptrblck/pytorch_misc/blob/master/batch_norm_manual.py, you have used weights and bias at two places

  1. x = torch.randn(10, 3, 100, 100) * scale + bias i assume here scale and bias are weights and bias which is getting updated every epoch. So these are the learnable parameters.
  2. input = input * self.weight[None, :, None, None] + self.bias[None, :, None, None] here you have multiplied the normalised activation with weight[1., 1., 1.] and bias[0., 0., 0.]. For every epoch weight/bias remains the same so these are not being learned.

Can you shed some light on the diffrences between these two weights/bias.

  1. These tensors were just used to create an input tensor, which does not have a zero mean and unit variance to show the effect of the batch norm layers. They won’t be updated but are random in each iteration.

  2. These parameters are the affine parameters and will be updated.

  1. About the second point as you mentioned weights/biases will be updated but in your implementation for all the 10 epochs it’s values remained same i.e [1,1,1] & [0,0,0] so i got confused.

In my example only the outputs without a parameter update are compared.
If you want to update them, you would have to create an optimizer, calculate the gradients via backward, and call optimizer.step().