Different learning rate for a specific layer

I want to change the learning rate of only one layer of my neural nets to a smaller value. I am aware that one can have per-layer learning rate according to this:
https://pytorch.org/docs/0.3.0/optim.html#per-parameter-options

However if I have a lot of layers, it is quite tedious to specific learning rate for each of them. Is there a more convenient way to specify one lr for just a specific layer and another lr for all other layers? Many thanks!

3 Likes

Yes, as you can see in the example of the docs you’ve linked, model.base.parameters() will use the default learning rate, while the learning rate is explicitly specified for model.classifier.parameters().

In your use case, you could filter out the specific layer and use the same approach.

2 Likes

Thanks. So i there a convenient way of filter out a specific layer? Cuz I have been searching for a while and could not fine one

You could use filter to get the base and “special” parameters:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 3, 1, 1)
        self.pool1 = nn.MaxPool2d(2)
        self.conv2 = nn.Conv2d(6, 12, 3, 1, 1)
        self.pool2 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(12*56*56, 128)
        self.fc2 = nn.Linear(128, 10)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

        
model = MyModel()

my_list = ['fc1.weight', 'fc1.bias']
params = list(filter(lambda kv: kv[0] in my_list, model.named_parameters()))
base_params = list(filter(lambda kv: kv[0] not in my_list, model.named_parameters()))

Depending on your model definition (custom vs. nn.Sequential etc.) some other code snippets might be prettier. Let me know, if that works for you.

19 Likes

Thanks a lot! this works for me.

Hello,
please consider the following code:

There are multiple layers before this affine operation and activation layer

    self.fc1 = nn.Linear(13824, 4096) 
    torch.nn.init.kaiming_normal_(self.fc1.weight) 
    self.batchnorm_fc1 = nn.BatchNorm1d(4096)
    #YL
    self.fc3 = nn.Linear(4096,  3* 11* 13)
    torch.nn.init.kaiming_normal_(self.fc3.weight)
    self.batchnorm_fc3 = nn.BatchNorm1d( 3* 11* 13)
    
    #YH[0]
    self.fc4 = nn.Linear(4096,   3* 3* 29* 39)
    torch.nn.init.kaiming_normal_(self.fc4.weight)
    self.batchnorm_fc4 = nn.BatchNorm1d( 3* 3* 29* 39)
    
    #YH[1] there are two more layers here 
    #YH[2] 

I start creating multiple heads here as the output, however YH[0] for instance has a product of numbers (you might call it weights) around 10k while YL has product of numbers of around ~400. I would like to treat each of those layers (and their weight) differently when it comes to their learning rates, otherwise the gradient is propagating through all those 400+10000+… numbers which overlooks the significance of YL sub-band output which is the most significant in my case. What do you recommend?!

with the forward propagation:

    x = F.leaky_relu( self.batchnorm_fc1(self.fc1(x)) )
    ###############

           
    xl = self.fc3(x)
    xl  = xl .view(-1, 3, 11, 13)
    
    xh=[]
    xh.append(self.fc4(x))
    xh[0] = xh[0].view(-1,3, 3, 29, 39)
    ##same done with the xh[1] and xh[2]

      
    # return the output
    return xl,xh

Could you guide me on how to handle xl, and xh’s layers weights differently?!

I am using those 4 output heads for constructing an inverse wavelet transform(2d) but that isn’t so important right now.

This stuff all worked great - thanks for the post!

I will point out something: If you want to use this with a scheduler (e.g. CyclicLR), then the lr’s of the parameter groups should also be passed to the scheduler constructor as a list of floats. Otherwise, the group dependent lr’s are lost.

Edit: note also that, for example, base_parms will not be a list of torch tensors (which it needs to be). It will be a list of tuples and one must use only the second component of each tuple.

2 Likes

Is this also required for LambdaLR and such? If so, can you please show how to implement it?

Hello, it gets error when I resume the optimizer.
I define the optimizer like that:

     # 削减公用层learning_rate, 这个很重要
     all_parameters = set(model.parameters())
     nas_layers_params = []
     for m in model.modules():
         if isinstance(m, BlockSwitch):
             nas_layers_params += list(m.parameters())
     nas_layers_params = set(nas_layers_params)
     comm_layers_params = all_parameters - nas_layers_params
     nas_layers_params = list(nas_layers_params)
     comm_layers_params = list(comm_layers_params)

     optimizer = torch.optim.Adam(
         [{"params": nas_layers_params},
          {"params": comm_layers_params, "lr": args.learning_rate/model.num_blocks_per_layer}  # 公用层learning_rate应取平均
         ],
         args.learning_rate,
         #momentum=args.momentum,
         weight_decay=args.weight_decay)
     scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
         optimizer, float(args.epochs), eta_min=args.learning_rate_min, last_epoch=-1)

Save model like that:

         # save the states of this epoch
         state = {
             'epoch': epoch,
             'args': args,
             'optimizer_state': optimizer.state_dict(),
             'supernet_state': model.state_dict(),
             'scheduler_state': scheduler.state_dict()
         }
         path = './super_train/{}/super_train_states.pt.tar'.format(args.exp_name)
         torch.save(state, path)

And load optimizer like that:

     if args.resume:
         resume_path = './super_train/{}/super_train_states.pt.tar'.format(args.exp_name)
         if os.path.isfile(resume_path):
             print("Loading checkpoint '{}'".format(resume_path))
             checkpoint = torch.load(resume_path)

             start_epoch = checkpoint['epoch']
             model.load_state_dict(checkpoint['supernet_state'])
             optimizer.load_state_dict(checkpoint['optimizer_state'])
             scheduler.load_state_dict(checkpoint['scheduler_state'])
         else:
             raise ValueError("No checkpoint found at '{}'".format(resume_path))

But get the error:

  File "train.py", line 197, in main
    train(args, epoch, train_data, device, model, criterion=criterion, optimizer=optimizer, my_choice=choice)
  File "train.py", line 77, in train
    optimizer.step()
  File "/data/limingyao/miniconda3/envs/py38/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 66, in wrapper
    return wrapped(*args, **kwargs)
  File "/data/limingyao/miniconda3/envs/py38/lib/python3.8/site-packages/torch/optim/adam.py", line 95, in step
    exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: The size of tensor a (80) must match the size of tensor b (240) at non-singleton dimension 0

Is the way I load optimizer.state_dict wrong?
Thank you

@ptrblck’s solution requires you to specify full names of names parameters for changed lr.
I wrote a recursive solution which lets you use just the submodule path.

The code:

from pprint import pprint
from typing import Dict
from torchvision import models


def group_wise_lr(model, group_lr_conf: Dict, path=""):
    """
    Refer https://pytorch.org/docs/master/optim.html#per-parameter-options


    torch.optim.SGD([
        {'params': model.base.parameters()},
        {'params': model.classifier.parameters(), 'lr': 1e-3}
    ], lr=1e-2, momentum=0.9)


    to


    cfg = {"classifier": {"lr": 1e-3},
           "lr":1e-2, "momentum"=0.9}
    confs, names = group_wise_lr(model, cfg)
    torch.optim.SGD([confs], lr=1e-2, momentum=0.9)



    :param model:
    :param group_lr_conf:
    :return:
    """
    assert type(group_lr_conf) == dict
    confs = []
    nms = []
    for kl, vl in group_lr_conf.items():
        assert type(kl) == str
        assert type(vl) == dict or type(vl) == float or type(vl) == int

        if type(vl) == dict:
            assert hasattr(model, kl)
            cfs, names = group_wise_lr(getattr(model, kl), vl, path=path + kl + ".")
            confs.extend(cfs)
            names = list(map(lambda n: kl + "." + n, names))
            nms.extend(names)

    primitives = {kk: vk for kk, vk in group_lr_conf.items() if type(vk) == float or type(vk) == int}
    remaining_params = [(k, p) for k, p in model.named_parameters() if k not in nms]
    if len(remaining_params) > 0:
        names, params = zip(*remaining_params)
        conf = dict(params=params, **primitives)
        confs.append(conf)
        nms.extend(names)

    plen = sum([len(list(c["params"])) for c in confs])
    assert len(list(model.parameters())) == plen
    assert set(list(zip(*model.named_parameters()))[0]) == set(nms)
    assert plen == len(nms)
    if path == "":
        for c in confs:
            c["params"] = (n for n in c["params"])
    return confs, nms


if __name__ == "__main__":
    model = models.resnet18(pretrained=True)

    test_configs = [
        # Give same Lr to all model params
        {"lr": 0.3},

        # For the below 3 cases, you will need to pass the optimiser overall optimiser params for remaining model params.
        # This is because we did not specify optimiser params for all top-level submodules, so defaults need to be supplied
        # Refer https://pytorch.org/docs/master/optim.html#per-parameter-options

        # Give same Lr to layer4 only
        {"layer4": {"lr": 0.3}},

        # Give one LR to layer4 and another to rest of model. We can do this recursively too.
        {"layer4": {"lr": 0.3},
         "lr": 0.5},

        # Give one LR to layer4.0 and another to rest of layer4
        {"layer4": {"0": {"lr": 0.001},
                    "lr": 0.3}},

        # More examples
        {"layer4": {"lr": 0.3,
                    "0": {"lr": 0.001}}},

        {"layer3": {"0": {"conv2": {"lr": 0.001}},
                    "1": {"lr": 0.003}}},

        {"layer4": {"lr": 0.3},
         "layer3": {"0": {"conv2": {"lr": 0.001}},
                    "lr": 0.003},
         "lr": 0.001}
    ]

    for cfg in test_configs:
        confs, names = group_wise_lr(model, cfg)
        print("#" * 140)
        pprint(cfg)
        print("-" * 80)
        pprint(confs)
        print("#" * 140)

Usage:
Suppose you have the resnet18 model from torchvision. Now you want to change LR of layer4 and layer3.1 and layer3.0.conv2:

from torchvision import models
model = models.resnet18(pretrained=True)
confs, names = group_wise_lr(model, {"layer4": {"lr": 0.3},
                                     "layer3": {"0": {"conv2": {"lr": 0.001}},
                                                "1": {"lr": 0.003}}})
# Notice we write in hierarchical structure, we go down the hierarchy till we need, not more.

This generates confs as 
```python
[{'lr': 0.3,
  'params': <generator object group_wise_lr.<locals>.<genexpr> at 0x11f4a84a0>},
 {'lr': 0.001,
  'params': <generator object group_wise_lr.<locals>.<genexpr> at 0x11f4a8510>},
 {'params': <generator object group_wise_lr.<locals>.<genexpr> at 0x11f4a8580>},
 {'lr': 0.003,
  'params': <generator object group_wise_lr.<locals>.<genexpr> at 0x11f4a85f0>},
 {'params': <generator object group_wise_lr.<locals>.<genexpr> at 0x11f4a87b0>}]

the last {"params": <generator>} has all the model params we did not specify any lr.
the third {"params": <generator>} has all params of layer3.0 which aren’t in submodule conv2
this allows you to pass this confs to an optimiser and do fine-grained lr tuning in a hierarchical manner.

i think these two api: model.base.parameters() , model.classifier.parameters() are made for the standard model classes, how about self defined models?

May i ask how to output the parameters of each layer for self defined models? such as each layer’s learning rate? It would be important in finetuning when the user want to check if the learing rate is correctly set, worked or not.

You can use the same syntax for custom modules to get the parameters, e.g.:

my_custom_model.my_custom_layer.parameters()

That being said, modules do not contain the learning rate, which is set in the creation of the optimizer.
If you want to use layer-specific learning rates, you could use the per-parameter options.

1 Like

Hey thanks a lot for this, I found this very helpful. Though is it possible to to apply a LR scheduler on one parameter group but not the other (meaning the other parameter group has a constant LR)?

I couldn’t figure it out, since the scheduler gets applied on all the parameter groups defined in your optimizer; so do you have any idea?

Hey there, I face the same problem here, I only need to change a little block’s learning rate, but the mostly of the model stay the same learning rate. How can I filter the parameters? I try defined a list and search in .named_paramters() like @ptrblck of 4th said, but I don’t know how to assert the new params list in torch.optim.SGD.

Update:
I fixed this problem by using list(filter()) method to flit the named_paramters and define SGD with specified block.paramters() , problem fixed. So, the key is to find out how the blocks defined in your model. :grinning: