If a weight is parametrized, and one uses weight decay on the optimizer. This weight decay is applied over the original weight or the parametrized weight?
I found the answer by myself. I will share my findings in case someone else stumbles with the same doubt.
weight_decay param is used by scaling the parameters and adding them accordingly to the selected optimizer. Usually, they are added to the gradients as
$g += \lambda*\theta$, in the case of AdamW a proportional term is added directly to the param update step(see AdamW Explained)
Since the actual parameter is the “original” parameter and not the parametrized one (as can be see from
model.named_parameters()), then the weight decay is applied over the original parameter and not the parametrized one.
Here is a snippet from
if weight_decay != 0: d_p = d_p.add(param, alpha=weight_decay)
As can be seen, the update is done over
param, given to the optimizer’s constructor on creation.