The mean and variance are not hardcoded. Both will be initialized and updated during training, i.e. the current batch will be normalized using its mean and variance, so that mean=0
and var=1
, while the running_mean
and running_var
will be updated using the momentum term. During evaluation (calling model.eval()
), the running estimates will be used.
No, it’s the number if features, e.g. the number of channels of the output activation of a conv layer. The running estimates will thus have the same dimension, i.e. for 100 channels the batch norm layer will have 100 running_mean
and running_var
values.
It depends on your model, data etc., like so many things.
As I tried to explain, the affine parameters might “undo” the standardization, if it would be the ideal activation output.
From the BatchNorm paper:
Note that simply normalizing each input of a layer maychange what the layer can represent. For instance, nor-malizing the inputs of a sigmoid would constrain them tothe linear regime of the nonlinearity. To address this, wemake sure thatthe transformation inserted in the networkcan represent the identity transform. To accomplish this, we introduce, for each activationx(k), a pair of parametersγ(k), β(k), which scale and shift the normalized value:y(k)=γ(k)̂x(k)+β(k).These parameters are learned along with the originalmodel parameters, and restore the representation powerof the network. Indeed, by settingγ(k)=√Var[x(k)] and β(k)=E[x(k)], we could recover the original activations,if that were the optimal thing to do.