I am a little bit confused about the learnable learning rate for each parameter in Pytorch.
I want to set a separate learnable learning rate for each parameter. The learning rate and model parameters will be optimized together during the training process.
I find some code:
optimizer = optim.Adam([
{'params': net.layer1.weight},
{'params': net.layer1.bias, 'lr': 0.01},
{'params': net.layer2.weight, 'lr': 0.001}
], lr=0.1, weight_decay=0.0001)
This method solves the problem of setting a fixed learning rate for each parameter, but it cannot allow the learning rate to be trained along with the model.
I tried to define a class with learnable learning rate:
class Learner(nn.Module):
def __init__(self, net, in_channels, num_classes):
super(Learner, self).__init__()
self.learner = net(in_channels, num_classes)
self.task_lr = OrderedDict()
def forward(self, X):
out = self.learner(X)
return out
def define_lr(self):
for key, val in self.named_parameters():
self.task_lr[key] = nn.Parameter(
1e-4 * torch.ones_like(val, requires_grad=True))
and then:
model_params = list(model.parameters()) + list(model.task_lr.values())
optimizer = optim.Adam(model_params, lr=args.lr)
adapted_state_dict = model.cloned_state_dict()
#####compute loss#####
#......
######################
model.zero_grads()
loss.backward()
for name, param in model.named_parameters():
if param.grad is not None:
adapted_state_dict[name] -= model.task_lr[name] * param.grad.data
model.load_state_dict(adapted_state_dict)
optimizer.step()
But this method still cannot train the learning rate.
Feel free to ask if more code is needed to explain the problem.