Hello, I have a question about methods like detach() (and tweaking requires_grad to False), and their relation to add_param_group. Does these 2 functions serve the exact opposite purpose, as one undo the other?
For example, if I build a net that may either freeze or unfreeze different layers for each backward pass during training, would it be suitable to use detach() (or tweak requires_grad) to omit them for frozen layers, and vice versa, using add_param_group to add them back into the optimizer’s param groups?
Even though you can use these functions to achieve complementary results, the functions themselves are not complementary at all.
detach() on a tensor returns a new tensor with the same value, device etc. The only difference is, that this tensor is not related to the previous computation graph at all (although it shares memory with the original tensor).
Adding a new param group has nothing to do with the computation graph at all, it only defines which parameters within a graph can be optimized by the optimizer.
@justusschock Thanks for the quick reply!
I see, so the underlying mechanics are different. So what if I do the following:
I freeze and unfreeze layers by using the with requires_grad depending on the particular graph topology before each forward-backward pass during the training loop. But when I add new layers back into the graph, I would need to add their parameters back into the optimizer with add_param_group.
Would that work? Also, do I need to take out parameter group from the optimizer when I freeze layers at some particular instance in-between the training loop too? (Opposite of add_param_group) Or is that already tracked and taken care of when I freeze it with requires_grad?
According to the source code, if the grad of a parameter is None, it will not update by optim.step.
So you could just set
requires_grad=False to the layers you want to freeze.
Ah I see, many thanks @MariosOreo!