I noticed that named_parameters return bn parameters too. What’s the effect if I pass bn parameters to optim?
BatchNorm are the
bias, which relate to the gamma and beta from the BatchNorm paper.
These are the learnable parameters of the layer, which might eliminate the normalization performed by the running stats.
That’s expected behavior.