Query about weight decay

DLopezG · July 31, 2020, 4:54pm

I understand this is the formula for L2 regularization:

Does weight decay equal lambda in this equation?

Kushaj · July 31, 2020, 5:33pm

Here lambda is l2-regularization factor. In case of SGD, this value is proportional to weight decay but for other optimizers like Adam this is not the case.

In short, weight decay is something that you subtract from the weight update equation directly.

DLopezG · July 31, 2020, 6:36pm

Then how to add L2 regularization when using Adam?

Kushaj · July 31, 2020, 8:16pm

You don’t. L2 regularization does not work well with the modern optimizers like Adam, weight decay is the option to go.

But if you still want to add l2 regularization just use optim.Adam and provide weight_decay argument in it (pytorch will use that argument for l2 regularization, AdamW solves this). I may be wrong on this as I have not followed the complete AdamW discussion on pytorch.