What is a good way to apply WeightDecay on bias terms?

Hi, guys,
I heard that WeightDecay might not be appropriate to be applied on all the bias terms. Is that true? And what is a good way to apply WeightDecay on bias terms?

Your answer and guide will be appreciated!

Hi Songyuc!

I haven’t experimented with this in any detail, but I have used weight
decay on entire networks, including the bias terms, without any problems.

It is true – at least according to my intuition – that the bias terms have
less redundancy in them than do, for example, the weight terms of
fully-connected layers. So it could be that weight decay could be less
beneficial for bias terms (and maybe more likely to cause problems).

You could either use the weight_decay feature of pytorch’s Optimizers
and use parameter groups to apply weight_decay to all of the Parameters
except bias terms or only apply weight_decay to some bias terms or apply
a weaker weight_decay to some bias terms, and so on.

You can also implement weight decay by adding an explicit sum-of-squares
(L2) penalty to your loss function, e.g.,

loss_weight_decay_penalty = alpha * (fc1.weight**2).sum()

This would apply weight decay of “strength” alpha to the weight
Parameter of Linear layer fc1, but not to fc1’s bias Parameter.

Best.

K. Frank

1 Like

Thanks sincerely! I think the Per-parameter options is a good way to realize this.
By the way, in my opinion, I think only the weights parameters of Conv layers in CNN need WeightDecay.