Weight decay on parametrized weight

grudloff · February 17, 2022, 2:05pm

If a weight is parametrized, and one uses weight decay on the optimizer. This weight decay is applied over the original weight or the parametrized weight?

grudloff · February 17, 2022, 10:02pm

I found the answer by myself. I will share my findings in case someone else stumbles with the same doubt.

Optimizer’s weight_decay param is used by scaling the parameters and adding them accordingly to the selected optimizer. Usually, they are added to the gradients as $g += \lambda*\theta$ , in the case of AdamW a proportional term is added directly to the param update step(see AdamW Explained)

Since the actual parameter is the “original” parameter and not the parametrized one (as can be see from model.named_parameters()), then the weight decay is applied over the original parameter and not the parametrized one.

Here is a snippet from sgd's implementation:

        if weight_decay != 0:
            d_p = d_p.add(param, alpha=weight_decay)

As can be seen, the update is done over param, given to the optimizer’s constructor on creation.