Hi, guys,

I heard that WeightDecay might not be appropriate to be applied on all the bias terms. Is that true? And what is a good way to apply WeightDecay on bias terms?

Your answer and guide will be appreciated!

Hi Songyuc!

I havenâ€™t experimented with this in any detail, but I have used weight

decay on entire networks, including the bias terms, without any problems.

It is true â€“ at least according to my intuition â€“ that the bias terms have

less redundancy in them than do, for example, the weight terms of

fully-connected layers. So it could be that weight decay could be less

beneficial for bias terms (and maybe more likely to cause problems).

You could either use the `weight_decay`

feature of pytorchâ€™s `Optimizer`

s

and use parameter groups to apply `weight_decay`

to all of the `Parameter`

s

except bias terms or only apply `weight_decay`

to some bias terms or apply

a weaker `weight_decay`

to some bias terms, and so on.

You can also implement weight decay by adding an explicit sum-of-squares

(L2) penalty to your loss function, e.g.,

```
loss_weight_decay_penalty = alpha * (fc1.weight**2).sum()
```

This would apply weight decay of â€śstrengthâ€ť `alpha`

to the `weight`

`Parameter`

of `Linear`

layer `fc1`

, but not to `fc1`

's `bias`

`Parameter`

.

Best.

K. Frank

Thanks sincerely! I think the *Per-parameter options* is a good way to realize this.

By the way, in my opinion, I think only the weights parameters of Conv layers in CNN need WeightDecay.