Optimize partial parameters in nn.Parameter

Hello, I am trying to optimize the part of parameters in “one” nn.Parameter variable.

e.g. a simple CNN classifier

class Model(nn.Module):
    def __init__(self, device):
        self.resnet = ...
        feature_dim, n_cls = 64, 10
        self.classifier = torch.nn.Parameter(torch.Tensor(n_cls, feature_dim)).to(device)

    def forward(self, input):
        out = self.resnet(input)
        return out.matmul(self.classifier.t())

my_model = Model()

In some cases, I only want to optimize the certain rows in self.classifier.
e.g. [self.classifier[0], self.classifier[2], self.classifier[4], self.classifier[6], self.classifier[8]]

I figured out two possible ways to implement an optimizer for this situation, but both of them caused error.

params = []
for i, para in enumerate(my_model.classifier):
    if i % 2 == 0:

optimizer = Adam(params, ...)

error msg: TypeError: optimizer can only optimize Tensors, but one of the params is Module.parameters

  1. another similar way but different parameter group for the optimizer
params = []
for i, para in enumerate(my_model.classifier):
    if i % 2 == 0:
        params.append({"params": para})

optimizer = Adam(params, ...)

error msg: ValueError: can’t optimize a non-leaf Tensor

Does anyone have any ideas to optimize some entry in a nn.Parameter?

You could define separate parts of the self.classifier parameter and only pass the parts, which should be optimized, to the optimizer. In the forward method you would then have to recreate the “full” parameter using torch.cat and/or torch.stack and apply it in the matmul.
Another approach would be to zero out the gradients of specific parts of the parameter, e.g. via register_hook, but this approach might be a bit tricky especially if your optimizer uses internal running stats.


Many thanks for your informative reply!
I first thought about approach 2 (zero out the gradient). However, I am not familiar with register_hook, so implement this approach before optimizer.step(), I manually zero out the gradients. In this case, if I init the optimizer with passing weight_decay argument (not zero), the optimizer still makes the parameter updated. (ref here)

Therefore, I think approach 1 (separate parts of parameter) might be better in practice. Explicitly tell the optimizer which parameters should be updated. (I think set requires_grad is also good, checked here)

Note: I am not sure whether register_hook can deal with the above mentioned update by weight_decay.

No, there wouldn’t be a difference in zeroing out the gradients by accessing the parameter directly or by using the hooks. For the mentioned shortcomings (or maybe unexpected behavior) I would also prefer the first approach.

I have implemented the first approach and it works well.
Many thanks for your kindly help!