Weight decay - Does it change layers with no gradients?

I optimize some parts of a model, while other parts are set fixed with requires_grad_(False).
Doesn’t weight decay change the weights of the layers, whose requires_grad is set to False, even if inputs are forwarded through these layers?
What about the case of a layer, whose requires_grad is set to True, but inputs are not forwarded through it during this optimization step?
Thanks

I don’t think so.
When I look here, I notice that they just pass the paremeters that require the grandient to F.adam, who in turn uses it for its computations.

2 Likes

Yeah as stated by @pascal_notsawo , you pass model.parameters() into the optimizer it automatically removes the tensors with requires_grad = False. So the weight decay will not effect them as they are no longer part of the optimization process.
For the second case, if the inputs are not forwarded then the weight decay will still effect them, this you can check by plotting them on tensorboard or any other visualization

2 Likes

Thank you.
I guess that the second case does not occur when the optimization step is done after zero_grad(set_to_none=True), does it?

The set_to_none=True is just a way to speed-up the zeroing of your gradient attributes. So, it shouldn’t change anything to do with weight decay.

It’s quicker to set all your param.grad attributes to None rather than torch.zeros_like!

1 Like

Thank you.
How does the weight decay still affect parameters of a layer, which no inputs are forwarded through it as @shivammehta007 said?
It seems that the gradients of this layer will remain None after the backward process, so these parameters will no be appended to params_with_grad in the step function. Then, it seems that these parameters will not be iterated over in F.adam and weight decay will not be applied on these parameters as @pascal_notsawo said.
By the way, according to zero_grad, it indeed does not zero parameters, which are already set to None, so set_to_none indeed does not change the situation.

I’m not sure 100% on the phasing of the statement but I assume you mean ‘forwarding’ as in calling a particular Module/Layer within your model? If this is true, then a particular layer within your network that doesn’t take any input data and just applies arithmetic to intermediate layers might not have a grad_fn (depending on the arithmetic at play) in the case of no grad_fn it’d be skipped by the optimizer step method (as you said)

1 Like