Does the weight decay in optim.SGD includes applying penalty on the batch normalization parameters?
optim.SGD