Load optimizer at specific layers?

Hedwig_Huang · September 17, 2022, 7:22am

I defined two models, which is generally the same structure,
while one is pre-trained, called as teacher model,
the other one is initialized from the teacher model, and defined some new layer, we named student model. (but not very strictly speaking teacher-student one.)

In the initialilze part, we load the pretrained model’s weight as well as the optimizer state_dict .

As I mentioned above, there are some difference of the structure with the teacher and the student, so when loading the pretrained optimzier state_dict…Error occurs like this:

ValueError: loaded state dict has a different number of parameter groups

But unfortunately, it seems not any solutions for loaded specific named layers with some clearly indexing like net.layer1.weight or some stuff to get the corresponding param_groups.

I found some solutions from here. He firstly, freeze the new layer with requires_grad as False, then loading or some stuff. I modified this solution fit my case.

Here is my sample code:

Define the network

import torch.nn as nn
import torch
import random

class Mynet1(nn.Module):    # teacher
    def __init__(self):
        super(Mynet1, self).__init__()
        self.layer1 = nn.Linear(1, 1)
        self.layer2 = nn.Linear(1, 1)
    
    def forward(self, x):
        out = self.layer1(x)
        out2 = self.layer2(out)

        return out2


class Mynet2(nn.Module):   # student
    def __init__(self):
        super(Mynet2, self).__init__()
        self.layer1 = nn.Linear(1, 1)
        self.layerK = nn.Linear(1, 1)
        self.layer2 = nn.Linear(1, 1)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layerK(out)   # new defined layer
        out2 = self.layer2(out)

        return out2

Train main.py

# the teacher network, pretend that is well trained and defined and loaded
net1 = Mynet1()  
opt1 = torch.optim.SGD(net1.parameters(), lr=0.01, weight_decay=3e-5, momentum=0.99, nesterov=True) 

# the student network
net2 = Mynet2()

loss = nn.L1Loss()
  
################################################
## Here is how I defined opt2 and load opt1 ##

# frozen the layerK first!
for name, param in net2.named_parameters():
    if 'K' in name:
        param.requires_grad = False
    else:
        param.requires_grad = True

opt2 = torch.optim.SGD((filter(lambda p: p.requires_grad, net2.parameters())), lr=0.01)

state = opt1.state_dict()

opt2.load_state_dict(state)
opt2.add_param_group({'params': net2.layerK.parameters(), 'lr': 0.02, 'weight_decay': 1, 'momentum': 0.99, 'nesterov': True})

# release layerK so all the layers can be trained
for name, param in net2.named_parameters():
    param.requires_grad = True

#################################################################

    # training
    net2.train()
    for i in range(500):
        x = y = random.randint(0, 10)
        x = torch.tensor([x]).type(dtype=torch.float)
        y = torch.tensor([y]).type(dtype=torch.float) 

        pred = net2(x)
        opt2.zero_grad()
        l = loss(pred, y)
        l.backward()
        opt2.step()

Could you please tell will the net2 and opt2 work as we want? like updating layer1 and layer2 with one settings while layerK with other settings I defined?
Is it reasonable?

Hedwig_Huang · September 27, 2022, 2:24pm

Hi dear sir! @ptrblck
I want to modify my optimizer with some layers with lr1 while others with lr2. Can I just do it with freeze some layers first and then add up some param_groups later? it is work?

Thank you for your helps!