Can we see the BatchNorm gamma & beta, parameters?
Are they, trained during backpropagation ?
Obrigado,
Can we see the BatchNorm gamma & beta, parameters?
Are they, trained during backpropagation ?
Obrigado,
Is there some way to only fix the running_mean and running_var, while updating the lambda and beta (weights and bias) in the training stage? Thanks
Gamma and beta should correspond to .weight
and .bias
, respectively.
Both are trained.
I came here regarding batchnorm and found what I needed, but hey, I thought I might give an answer to that âunrelated questionâ:
Typically, dropout is applied after the non-linear activation function (a). However, when using rectified linear units (ReLUs), it might make sense to apply dropout before the non-linear activation (b) for reasons of computational efficiency depending on the particular code implementation.
(a): Fully connected, linear activation -> ReLU -> Dropout -> âŚ
(b): Fully connected, linear activation -> Dropout -> ReLU -> âŚ
e.g., say we have the following activations in our hidden layer: [-1, -2, -3, 4, 5, 6]. Tthe output of the ReLU function is the following:
[0, 0, 0, 4, 5, 6]
then, with dropout, we might get the following with a 50% drop proba:
[0*2, 0, 0*2, 0, 0*2, 0] = [0, 0, 0, 0, 10, 0]
Now, if we would pass the input to the ReLU first and then the dropout, we would get the exact same results:
[-1, -2, -3, 4, 5, 6] -> [-1*2, 0, -3*2, 0, 5*2, 0]
[-2, 0, -6, 0, 10, 0] -> [0, 0, 0, 0, 10, 0]
But in the pytorch documentation, there is an example of âConvNet as fixed feature extractorâ where the features are obtained from the pretrained resnet model and they only set requires_grad to False to freeze the whole network. Are you sure that requires_grad=False, only freezes the parameters, and not the moving averages?
I didnât understand the 3rd point. Could you please explain it again?
Why donât you add a batch-norm
layer after conv1
layer?
What happens when we put batchnorm in eval mode? Does it use mean and variance of the whole dataset for evaluation time or we have to calculate them manually or what? Any example to use batchnorm in evaluation time for only one sample? I donât have a batch in eval time. One sample per time
During eval
the running estimates will be used, which were updated during training using the batch mean and std.
You can see these stats using bn.running_mean
and bn.running_var
.
Since these running estimates do not depend on the batch size anymore, you are fine using single samples during validation and testing.
Thank you. So just using with torch.no_grad():
and model.eval()
and call the model with batchnorm is ok? I remember when I used the trained model and used these two commands, the test results were too bad. After that, I decided to remove the bn layer and train the model again and never used bn again. But I was looking for a solution. I have to check it again.
Thanks
Yes, it should be OK. However, if your running stats are skewed e.g. due to a small batch size during training or if the training and validation data comes from different distributions, you might get bad results.
One way to counter this effect would be to change the momentum
in the batchnorm layers, so that each update of the running estimates is smaller.
If your batch size is âsmallâ, you might get better results with e.g. nn.GroupNorm
or nn.InstanceNorm
.
Thank you for great hints.
@ptrblck i was going through your implementation of batch norm https://github.com/ptrblck/pytorch_misc/blob/master/batch_norm_manual.py, you have used weights and bias at two places
x = torch.randn(10, 3, 100, 100) * scale + bias
i assume here scale and bias are weights and bias which is getting updated every epoch. So these are the learnable parameters.input = input * self.weight[None, :, None, None] + self.bias[None, :, None, None]
here you have multiplied the normalised activation with weight[1., 1., 1.] and bias[0., 0., 0.]. For every epoch weight/bias remains the same so these are not being learned.Can you shed some light on the diffrences between these two weights/bias.
These tensors were just used to create an input tensor, which does not have a zero mean and unit variance to show the effect of the batch norm layers. They wonât be updated but are random in each iteration.
These parameters are the affine parameters and will be updated.
In my example only the outputs without a parameter update are compared.
If you want to update them, you would have to create an optimizer, calculate the gradients via backward
, and call optimizer.step()
.