I was wondering if the parameters of batch_norm layers are considered when computing the L2_norm of weight decay in Pytorch’s implementation?
The weight_decay
argument will be applied to the current parameter group. I.e. if you are passing the batchnorm parameters to this group (or re just using a single group and are passing all parameters) weight decay will also be applied on them.