Train only part of variable in the network

Is there a way to train only part of a variable? I know we can set require_grad = False to a variable. But I want to train only part of the variable. Is there a way to do this?

Separate your parameter into to tensors and cat then together.

using torch.split()?

For example

param = list(model.parameters())
optimizer = torch.optim.SGD(param[0][:10,], lr = 0.01, momentum=0.9)

or

optimizer = torch.optim.SGD([list(torch.split(param[0],10,0))[0], lr = 0.01, momentum=0.9)

It returns ValueError: can’t optimize a non-leaf Variable
I find out that after I split the parameters, it become a non-leaf node
What I want is to dynamically train sub newtork. Like [requires_grad=True/False dynamically]
I want to split the parameter into even smaller parts. Is there a way to do that?

I meant that keeping them as separate tensors, and before activation concatenating them together to use as a single parameter.

I understand what you are describing. I have to declare separated leaf-node parameters and concat them together and them put them into a module. Thus, we ll have leaf-node separated parameters.
So it is not possible to modify nn.module leaf-node parameters?
Things like nn.Llinear, nn.Conv2D once created cannot be separated and trained later?

I’m not sure what you mean by separating parameters, but you can achieve it using regsiter_forward_pre_hook, or just use the torch.nn.functonal.* functional forms.

For example
model = nn.Linear()
model.parameters() return a tuple of weight and bias. If I want to split the weight into several chunks along dimension 0. I can use
list(model.parameters())[0][:chunk_size,]
or
list(torch.split(param[0],chunk_size,0))[0]
But both of them will return a non-leaf node which cannot be the input of an optimizer.
One way to resolve is to write my own Linear() module and make weight isolated in advance. Then cat them together into self.weight and finally do things like

F.linear(input, self.weight, self.bias)

My question is whether I can just do some magic to the parameters of nn.Linear() that pytorch provided and achieve this goal. The reason is that it is tedious to write every module (nn.Con2d etc.) once again.

No, do not split existing parameters. Define your own ones and cat them together in forward or hooks.

this might be a good example https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/weight_norm.py#L21

Thanks for ur inference.
I ll dig into it.
Another potential problem is that I need set all splitted parameters requires_grad = True. But I do not want to compute grad for some of the parameters occasionally.
For example,

param = [param0 param1]
optimizer = torch.optim.SGD(param0, lr = 0.01, momentum=0.9)

I dont want to compute grad for param1. But both of them have requires_grad = True
If I do

loss.backward()

This is related to [requires_grad=True/False dynamically]
However, in my case both param0 and param1 are involed in computing the loss function. Will the grad of param1 also be computed? How can I avoid that?
I realize there is a detach_() may be useful. But there is not an attach() method. [https://github.com/pytorch/pytorch/pull/6561]

You can either change the flag before forward or just use torch.autograd.grad to specify the inputs for which grad should be computed

Yeah.
Thats one way to do it. With only_intputs = True, it is possible to compute param0 only.
Should I also set retain_graph = True for the optimizer to step?
Is there any difference between using .grad() and .backward() except that .grad() returns the gradients?

I don’t know if there any “not wanted secondary effect” of this, but maybe you can multiply the grad variable by zero in those part of the variable that you don’t want to modify. Just after the backward part, and before the optimizer step

I wonder whether there is a significant computation cost for those parameters I am not interested.
I have not got a time to dig into automatic differentiation.
For now torch.autograd.grad is not working in this situation.
If I set only_inputs = True (default), the gradient is not accumulated for all parameters and loss does not decrease.
If I set only_inputs = False, the gradient of parameters that I do not put into optimizer are also computed and accumulated because they are all leaf-nodes and have requires_grad = True.
For now I assume that I have to wait pytorch update attach()(or other similar method) to enable dynmaically optimizing selected features.
[https://github.com/pytorch/pytorch/pull/6561]

From pytorch 4.0.
Thanks to requires_grad_() .

Hi, @dem123456789, Do you solve the problem of setting part of weights in a layer to require_grad=False?

No. As soon as you concatenate or stack two parameter tensors into one, the gradient of them will be calcuated together.