I defined two models, which is generally the same structure,
while one is pre-trained, called as teacher model,
the other one is initialized from the teacher model, and defined some new layer, we named student model. (but not very strictly speaking teacher-student one.)
In the initialilze part, we load the pretrained model’s weight as well as the optimizer state_dict .
As I mentioned above, there are some difference of the structure with the teacher and the student, so when loading the pretrained optimzier state_dict…Error occurs like this:
ValueError: loaded state dict has a different number of parameter groups
But unfortunately, it seems not any solutions for loaded specific named layers with some clearly indexing like net.layer1.weight
or some stuff to get the corresponding param_groups.
I found some solutions from here. He firstly, freeze the new layer with requires_grad
as False
, then loading or some stuff. I modified this solution fit my case.
Here is my sample code:
Define the network
import torch.nn as nn
import torch
import random
class Mynet1(nn.Module): # teacher
def __init__(self):
super(Mynet1, self).__init__()
self.layer1 = nn.Linear(1, 1)
self.layer2 = nn.Linear(1, 1)
def forward(self, x):
out = self.layer1(x)
out2 = self.layer2(out)
return out2
class Mynet2(nn.Module): # student
def __init__(self):
super(Mynet2, self).__init__()
self.layer1 = nn.Linear(1, 1)
self.layerK = nn.Linear(1, 1)
self.layer2 = nn.Linear(1, 1)
def forward(self, x):
out = self.layer1(x)
out = self.layerK(out) # new defined layer
out2 = self.layer2(out)
return out2
Train main.py
# the teacher network, pretend that is well trained and defined and loaded
net1 = Mynet1()
opt1 = torch.optim.SGD(net1.parameters(), lr=0.01, weight_decay=3e-5, momentum=0.99, nesterov=True)
# the student network
net2 = Mynet2()
loss = nn.L1Loss()
################################################
## Here is how I defined opt2 and load opt1 ##
# frozen the layerK first!
for name, param in net2.named_parameters():
if 'K' in name:
param.requires_grad = False
else:
param.requires_grad = True
opt2 = torch.optim.SGD((filter(lambda p: p.requires_grad, net2.parameters())), lr=0.01)
state = opt1.state_dict()
opt2.load_state_dict(state)
opt2.add_param_group({'params': net2.layerK.parameters(), 'lr': 0.02, 'weight_decay': 1, 'momentum': 0.99, 'nesterov': True})
# release layerK so all the layers can be trained
for name, param in net2.named_parameters():
param.requires_grad = True
#################################################################
# training
net2.train()
for i in range(500):
x = y = random.randint(0, 10)
x = torch.tensor([x]).type(dtype=torch.float)
y = torch.tensor([y]).type(dtype=torch.float)
pred = net2(x)
opt2.zero_grad()
l = loss(pred, y)
l.backward()
opt2.step()
Could you please tell will the net2
and opt2
work as we want? like updating layer1
and layer2
with one settings while layerK
with other settings I defined?
Is it reasonable?