I use this line “optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay)” to do L2 regularization to prevent overfitting. Generally, regularization only penalizes the weight W parameter of the model, and the bias parameter b does not penalize, but there is a network saying that the weight decay specified by the optimizer weight_decay parameter of torch.optim is for all parameters in the network , Including the weight w and bias b for simultaneous punishment. Is that right?
If you wish to turn off weight decay for your network biases, you may
use “parameter groups” to use different optimizer hyperparameters to
optimize different sets of network parameters.
opt will use a learning rate of 0.1 for all of lin's parameters – both weight and bias – but will only use a weight decay of 0.5 for weight
and no weight decay (weight_decay = 0.0) for bias.