at the moment Im trying to implement a CVAE which I want to retrain after it learned on reference. For that retraining part I want to freeze every part of the network except the weights in the first encoder layer that are responsible for the conditions that are represented in the new data.
What Im doing is to use the requires_grad flag and set it to false for every layer except the first encoder layer before retraining. On top of that I set the gradient to 0 in the trainer during training like this:
for name, p in self.model.named_parameters():
p.grad[:, :-self.model.conditions] = 0
Im using a Adam optimizer for this model. But sadly the results arent really good. Does this way of freezing the weights disturb the optimizer? Or is there a better option to partially freeze a layer in pytorch?
I would be glad about any help
Your approach looks correct.
Note that Adam (and other optimizers with internal running estimates) could still change all parameters, even if their gradient is zero, which might create the issue you are seeing.
Thank you for your answer. So is there a valid approach to be sure that those weights get freezed with using Adam?
Would it be possible to do it like Keras does it with its kernels?
To use parallel layers somehow, like one part for normal input and the other for conditions, so you can set requires_grad to False for the condition-part layer?
Unfortunately, I don’t know how Keras “kernels” work.
You could either create a new optimizer with a subset of the parameters or try to reset the running stats.
Alternatively, you could try to set the
.grad attributes to
None for all “frozen” parameters and check, if they would still be updated (I haven’t tested this approach in the latest release and it’s quite hacky so don’t rely on it, please).
If you say “new optimizer with a subset of parameters” is it possible to choose only a subset of some weights of one layer or is it only possible to choose a whole layer as a subset of parameters for an optimizer?
It’s only possible to pass complete parameters to the optimizer.
Hm okay for retraining Im already doing this in the following way:
if params is None:
params = filter(lambda p: p.requires_grad, self.model.parameters())
self.optimizer = torch.optim.Adam(
params, lr=lr, eps=eps, weight_decay=self.weight_decay
But then I still have the problem that the whole first Layer is unfreezed and not only the weights for the conditions…
If the layer was updated before, the potential running estimates in the optimizer for this parameter wouldn’t have been created, so you should be able to zero out the gradients for this part of the layer.
Hm okay interesting, so it should work…
I also thought about another idea: To have two parallel layers for the input, one for the normal input data and one for the condition input. So at the end of both layer calculations I could just add the two outputs together and use that for the next layers as input. In this way I could set the requires_grad flag to False for the first parallel layer that is responsible for the normal input data and leave the requires_grad flag for the second parallel layer that is responsible for the conditions at True.
Does that make sense to you? Or do you see any problems there?
Thanks for your help again and sorry that Im asking so much, but Im pretty new to pytorch