What is standard scale of BatchNorm1d?

I checked a video explanation of batch norm and the focal point of that video is the idea that we “put our data on known or standard scale”. This is often scale [0, 1] as in the next image:

On the other hand it is unclear to me what is the standard scale for BatchNorm1d. Is this the gamma parameter (γ)?

BatchNorm will standardize the data by default using the mean and variance (or running mean and running variance during eval). The affine parameters gamma and beta (weight and bias in nn.BatchNorm2d, respectively) allow this layer to undo the standardization or to manipulate it in any way beneficial for the model and training.
E.g. if standardizing the activation might not be beneficial for the training, gamma might be pushed towards the value of the standard deviation, and beta might be learned to be close to the mean value of the data, which will eventually create an identity layer.

Let me know, if this makes things clearer or if I’ve missed some points.


You are somehow close enough, but let me explain what I am particularly unclear.
The batch normalization applies to a layer that we can represent as a tensor of activations A. Values of A range somewhere between [r1, r2] this means that all the activations are in that interval.

After the batch norm which is just a transformation of the activations A, we will get activations tensor B. What would be the range of the B?

If you are using the affine parameters, you can’t limit the value range of the output activations, since e.g. a high weight value might scale the values to an arbitrary large range.

However, if you set affine=False, during training each batch will be standardized using your formula, which might create a standard normal distribution. The values would therefore be determined by the 68-95-99.7 rule. So you will not limit the range to particular values. However, take these explanations with a grain of salt, as the distribution of your input activations might be completely different.

1 Like

I am assuming with affine=True, the batch norm should be capable of learning all 4 parameters exactly. The mean, the standard deviation, the gamma, and the beta.

With affine=False this is not the case, in the example it states “without learnable parameters” so we have some predefined (hardcoded) batch normalization.

I am not sure, but the only parameter from the example m = nn.BatchNorm1d(100) is 100. Is this the batch size?

This is modified example from before…

import torch
import torch.nn as nn

# With Learnable Parameters
m = nn.BatchNorm1d(10)
# Without Learnable Parameters
m = nn.BatchNorm1d(10, affine=False)
input =  1000* torch.randn(3, 10)
output = m(input)

And it provides the output:

tensor([[  553.9385,   358.0948,  -311.0905,   946.6582,   320.4365,  1158.4661,
           610.9046,  -300.3665,  -730.4023,  -432.1760],
        [ -428.9000,  -373.9978,   304.2072,  -230.9816,   246.2242,   757.4435,
          -489.5371, -2545.6099,  2042.6073,  -763.0421],
        [ -350.7178,  1166.3325,  -511.8971, -1168.5955,  1719.1627,   -95.6929,
          1275.7457,  2684.2368, -1186.4659, -1935.6162]])
tensor([[ 1.4106, -0.0403, -0.3979,  1.2684, -0.6516,  1.0550,  0.1995, -0.1150,
         -0.5413,  0.9479],
        [-0.7929, -1.2041,  1.3742, -0.0925, -0.7612,  0.2882, -1.3122, -1.1632,
          1.4021,  0.4350],
        [-0.6177,  1.2444, -0.9763, -1.1759,  1.4128, -1.3431,  1.1128,  1.2782,
         -0.8609, -1.3829]])

Looks like it will try to squeeze the outputs to (-1,1) range with as few as possible elements out of that range.

Any comments?

The idea all 4 parameters can be learnable I found in video: https://youtu.be/dXB-KQYkzNU?t=270

@ptrblck as I analyzed further even without learning parameters the output from a batch norm will have standard deviation of 1 and mean of 0 which rises the question of why do we need learnable parameters at the first place?

import torch
import torch.nn as nn

# Without Learnable Parameters
m = nn.BatchNorm1d(10, affine=False)
input =  1000* torch.randn(3, 10)
output = m(input)
print(output.mean()) # will be close to 0
print(output.std()) # will be close to 1

The mean and variance are not hardcoded. Both will be initialized and updated during training, i.e. the current batch will be normalized using its mean and variance, so that mean=0 and var=1, while the running_mean and running_var will be updated using the momentum term. During evaluation (calling model.eval()), the running estimates will be used.

No, it’s the number if features, e.g. the number of channels of the output activation of a conv layer. The running estimates will thus have the same dimension, i.e. for 100 channels the batch norm layer will have 100 running_mean and running_var values.

It depends on your model, data etc., like so many things.
As I tried to explain, the affine parameters might “undo” the standardization, if it would be the ideal activation output.
From the BatchNorm paper:

Note that simply normalizing each input of a layer maychange what the layer can represent. For instance, nor-malizing the inputs of a sigmoid would constrain them tothe linear regime of the nonlinearity. To address this, wemake sure thatthe transformation inserted in the networkcan represent the identity transform. To accomplish this, we introduce, for each activationx(k), a pair of parametersγ(k), β(k), which scale and shift the normalized value:y(k)=γ(k)̂x(k)+β(k).These parameters are learned along with the originalmodel parameters, and restore the representation powerof the network. Indeed, by settingγ(k)=√Var[x(k)] and β(k)=E[x(k)], we could recover the original activations,if that were the optimal thing to do.


OK, I got more than enough information. I have some intriguing questions, but I will ask them in another thread.