Different learning rate for different type of module

dlmacedo · October 24, 2017, 4:16am

How can I have all PReLU to have a learning rate that is 0.1 times the used by the other layer?

SimonW · October 24, 2017, 8:34pm

On top of my head, there are two options:

write your own lr scheduler (see examples here:https://github.com/pytorch/pytorch/blob/master/torch/optim/lr_scheduler.py)
use different optimizers for different parts of your network.

ptrblck · October 24, 2017, 8:52pm

You can check out optim-per-parameter-options, where there is a small example how to set different learning rates for your layers.

Optimizer s also support specifying per-parameter options. To do this, instead of passing an iterable of Variable s, pass in an iterable of dict s. Each of them will define a separate parameter group, and should contain a params key, containing a list of parameters belonging to it. Other keys should match the keyword arguments accepted by the optimizers, and will be used as optimization options for this group.

optim.SGD([
{‘params’: model.base.parameters()},
{‘params’: model.classifier.parameters(), ‘lr’: 1e-3}
], lr=1e-2, momentum=0.9)

dlmacedo · October 24, 2017, 8:57pm

The problema is my customized module is used in many places in the model… They are spread… Not thogheter… But they have the same type… Like use PReLU in diferent places of the model…

SimonW · October 25, 2017, 3:33am

I just remembered this option and came here to comment. Thanks for pointing that out before me @ptrbick!

SimonW · October 25, 2017, 3:35am

Oh so you want its parameter to have 0.1 of the original gradient no matter where it is used? How about adding a automatic backward hook on that module’s parameter? You can do that in the constructor even.

dlmacedo · October 25, 2017, 3:35pm

Could you please give a piece of example code or link?

dlmacedo · October 25, 2017, 3:48pm

class DDReLU(nn.Module):
def __init__(self):
    super(DDReLU, self).__init__()
    self.threshold = nn.Parameter(torch.rand(1), requires_grad=True)
    self.register_backward_hook(lambda grad: grad * 0.1)
    self.ReLU = nn.ReLU(True)

def forward(self, x):
    return self.ReLU(x) + self.threshold

dlmacedo · October 25, 2017, 3:51pm

http://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html?highlight=hook#forward-and-backward-function-hooks

http://pytorch.org/docs/master/autograd.html

drcege · November 15, 2022, 3:58am

I encountered an issue of implementing dynamic learning rate. I want to give each tensor a different lr before each backward (i.e., before optim.step).

After skimming the source code of torch.optim, torch.optim.adam, and torch.optim.sgd, I realized that this is impossible as lr is passed as a fixed value when defining the optimizer.

I think what I need is a learning rate hook to modify lr of each tensor before optim.step, which is somehow like module’s backward hook from which we can modify the gradients.

ptrblck · November 15, 2022, 5:56am

You can use different learning rates for each parameter via the Per-parameter options. However, if you want to manipulate it in every step, you might either need to recreate the optimizer (in which care stateful optimizers would re-initialize their running stats) or you could indeed use backward hooks to manipulate the .grad attribute directly.

drcege · November 15, 2022, 6:36am

Thanks, recreating the optimizer might be a solution but could be cumbersome and inefficient.

I considered manipulating the .grad attribute directly, but for optimizers with moving average like Adam, I found it’s hard to achieve the same effect.