Train only part of variable in the network

dem123456789 · April 15, 2018, 12:09pm

Is there a way to train only part of a variable? I know we can set require_grad = False to a variable. But I want to train only part of the variable. Is there a way to do this?

SimonW · April 15, 2018, 7:34pm

Separate your parameter into to tensors and cat then together.

dem123456789 · April 16, 2018, 2:30am

using torch.split()?

dem123456789 · April 16, 2018, 11:55am

For example

param = list(model.parameters())
optimizer = torch.optim.SGD(param[0][:10,], lr = 0.01, momentum=0.9)

or

optimizer = torch.optim.SGD([list(torch.split(param[0],10,0))[0], lr = 0.01, momentum=0.9)

It returns ValueError: can’t optimize a non-leaf Variable
I find out that after I split the parameters, it become a non-leaf node
What I want is to dynamically train sub newtork. Like [requires_grad=True/False dynamically]
I want to split the parameter into even smaller parts. Is there a way to do that?

SimonW · April 16, 2018, 2:37pm

I meant that keeping them as separate tensors, and before activation concatenating them together to use as a single parameter.

dem123456789 · April 17, 2018, 12:57am

I understand what you are describing. I have to declare separated leaf-node parameters and concat them together and them put them into a module. Thus, we ll have leaf-node separated parameters.
So it is not possible to modify nn.module leaf-node parameters?
Things like nn.Llinear, nn.Conv2D once created cannot be separated and trained later?

SimonW · April 17, 2018, 1:37am

I’m not sure what you mean by separating parameters, but you can achieve it using regsiter_forward_pre_hook, or just use the torch.nn.functonal.* functional forms.

dem123456789 · April 17, 2018, 2:11am

For example
model = nn.Linear()
model.parameters() return a tuple of weight and bias. If I want to split the weight into several chunks along dimension 0. I can use
list(model.parameters())[0][:chunk_size,]
or
list(torch.split(param[0],chunk_size,0))[0]
But both of them will return a non-leaf node which cannot be the input of an optimizer.
One way to resolve is to write my own Linear() module and make weight isolated in advance. Then cat them together into self.weight and finally do things like

F.linear(input, self.weight, self.bias)

My question is whether I can just do some magic to the parameters of nn.Linear() that pytorch provided and achieve this goal. The reason is that it is tedious to write every module (nn.Con2d etc.) once again.

SimonW · April 17, 2018, 2:23am

No, do not split existing parameters. Define your own ones and cat them together in forward or hooks.

SimonW · April 17, 2018, 2:24am

this might be a good example https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/weight_norm.py#L21

dem123456789 · April 17, 2018, 3:09am

Thanks for ur inference.
I ll dig into it.
Another potential problem is that I need set all splitted parameters requires_grad = True. But I do not want to compute grad for some of the parameters occasionally.
For example,

param = [param0 param1]
optimizer = torch.optim.SGD(param0, lr = 0.01, momentum=0.9)

I dont want to compute grad for param1. But both of them have requires_grad = True
If I do

loss.backward()

This is related to [requires_grad=True/False dynamically]
However, in my case both param0 and param1 are involed in computing the loss function. Will the grad of param1 also be computed? How can I avoid that?
I realize there is a detach_() may be useful. But there is not an attach() method. [https://github.com/pytorch/pytorch/pull/6561]

SimonW · April 18, 2018, 6:58am

You can either change the flag before forward or just use torch.autograd.grad to specify the inputs for which grad should be computed

dem123456789 · April 18, 2018, 7:33am

Yeah.
Thats one way to do it. With only_intputs = True, it is possible to compute param0 only.
Should I also set retain_graph = True for the optimizer to step?
Is there any difference between using .grad() and .backward() except that .grad() returns the gradients?

JuliousHurtado · April 18, 2018, 6:14pm

I don’t know if there any “not wanted secondary effect” of this, but maybe you can multiply the grad variable by zero in those part of the variable that you don’t want to modify. Just after the backward part, and before the optimizer step

dem123456789 · April 20, 2018, 5:28am

I wonder whether there is a significant computation cost for those parameters I am not interested.
I have not got a time to dig into automatic differentiation.
For now torch.autograd.grad is not working in this situation.
If I set only_inputs = True (default), the gradient is not accumulated for all parameters and loss does not decrease.
If I set only_inputs = False, the gradient of parameters that I do not put into optimizer are also computed and accumulated because they are all leaf-nodes and have requires_grad = True.
For now I assume that I have to wait pytorch update attach()(or other similar method) to enable dynmaically optimizing selected features.
[https://github.com/pytorch/pytorch/pull/6561]

dem123456789 · April 25, 2018, 10:17am

From pytorch 4.0.
Thanks to requires_grad_() .

Leaves · December 25, 2020, 1:12pm

Hi, @dem123456789, Do you solve the problem of setting part of weights in a layer to require_grad=False?

dem123456789 · December 25, 2020, 1:39pm

No. As soon as you concatenate or stack two parameter tensors into one, the gradient of them will be calcuated together.