Hi, guys,
I heard that WeightDecay might not be appropriate to be applied on all the bias terms. Is that true? And what is a good way to apply WeightDecay on bias terms?
Your answer and guide will be appreciated!
Hi, guys,
I heard that WeightDecay might not be appropriate to be applied on all the bias terms. Is that true? And what is a good way to apply WeightDecay on bias terms?
Your answer and guide will be appreciated!
Hi Songyuc!
I haven’t experimented with this in any detail, but I have used weight
decay on entire networks, including the bias terms, without any problems.
It is true – at least according to my intuition – that the bias terms have
less redundancy in them than do, for example, the weight terms of
fully-connected layers. So it could be that weight decay could be less
beneficial for bias terms (and maybe more likely to cause problems).
You could either use the weight_decay
feature of pytorch’s Optimizer
s
and use parameter groups to apply weight_decay
to all of the Parameter
s
except bias terms or only apply weight_decay
to some bias terms or apply
a weaker weight_decay
to some bias terms, and so on.
You can also implement weight decay by adding an explicit sum-of-squares
(L2) penalty to your loss function, e.g.,
loss_weight_decay_penalty = alpha * (fc1.weight**2).sum()
This would apply weight decay of “strength” alpha
to the weight
Parameter
of Linear
layer fc1
, but not to fc1
’s bias
Parameter
.
Best.
K. Frank
Thanks sincerely! I think the Per-parameter options is a good way to realize this.
By the way, in my opinion, I think only the weights parameters of Conv layers in CNN need WeightDecay.