How does SGD weight_decay work?

The part that I circled doesn’t seem right to me:

In L2 regularization, you modify the cost as follows

The weight update should be then

The way PyTorch applied the weight decay seems correct to me (you can drop the factor 2)

3 Likes