Different learning rate for different type of module

How can I have all PReLU to have a learning rate that is 0.1 times the used by the other layer?

1 Like

On top of my head, there are two options:

  1. write your own lr scheduler (see examples here:https://github.com/pytorch/pytorch/blob/master/torch/optim/lr_scheduler.py)
  2. use different optimizers for different parts of your network.

You can check out optim-per-parameter-options, where there is a small example how to set different learning rates for your layers.

Optimizer s also support specifying per-parameter options. To do this, instead of passing an iterable of Variable s, pass in an iterable of dict s. Each of them will define a separate parameter group, and should contain a params key, containing a list of parameters belonging to it. Other keys should match the keyword arguments accepted by the optimizers, and will be used as optimization options for this group.

optim.SGD([
{‘params’: model.base.parameters()},
{‘params’: model.classifier.parameters(), ‘lr’: 1e-3}
], lr=1e-2, momentum=0.9)

6 Likes

The problema is my customized module is used in many places in the model… They are spread… Not thogheter… But they have the same type… Like use PReLU in diferent places of the model…

I just remembered this option and came here to comment. Thanks for pointing that out before me @ptrbick!

Oh so you want its parameter to have 0.1 of the original gradient no matter where it is used? How about adding a automatic backward hook on that module’s parameter? You can do that in the constructor even.

Could you please give a piece of example code or link?

class DDReLU(nn.Module):
def __init__(self):
    super(DDReLU, self).__init__()
    self.threshold = nn.Parameter(torch.rand(1), requires_grad=True)
    self.register_backward_hook(lambda grad: grad * 0.1)
    self.ReLU = nn.ReLU(True)

def forward(self, x):
    return self.ReLU(x) + self.threshold
4 Likes

http://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html?highlight=hook#forward-and-backward-function-hooks

http://pytorch.org/docs/master/autograd.html

I encountered an issue of implementing dynamic learning rate. I want to give each tensor a different lr before each backward (i.e., before optim.step).

After skimming the source code of torch.optim, torch.optim.adam, and torch.optim.sgd, I realized that this is impossible as lr is passed as a fixed value when defining the optimizer.

I think what I need is a learning rate hook to modify lr of each tensor before optim.step, which is somehow like module’s backward hook from which we can modify the gradients.

You can use different learning rates for each parameter via the Per-parameter options. However, if you want to manipulate it in every step, you might either need to recreate the optimizer (in which care stateful optimizers would re-initialize their running stats) or you could indeed use backward hooks to manipulate the .grad attribute directly.

Thanks, recreating the optimizer might be a solution but could be cumbersome and inefficient.

I considered manipulating the .grad attribute directly, but for optimizers with moving average like Adam, I found it’s hard to achieve the same effect.