I defined two models, which is generally the same structure,

while one is pre-trained, called as **teacher** model,

the other one is initialized from the **teacher** model, and defined some new layer, we named **student** model. (but not very strictly speaking teacher-student one.)

In the initialilze part, we load the **pretrained model’s weight** as well as the **optimizer state_dict** .

As I mentioned above, there are some difference of the structure with the teacher and the student, so when loading the pretrained optimzier state_dict…Error occurs like this:

`ValueError: loaded state dict has a different number of parameter groups`

But unfortunately, it seems not any solutions for loaded specific named layers with some clearly indexing like `net.layer1.weight`

or some stuff to get the corresponding param_groups.

I found some solutions from here. He firstly, freeze the new layer with `requires_grad`

as `False`

, then loading or some stuff. I modified this solution fit my case.

Here is my sample code:

**Define the network**

```
import torch.nn as nn
import torch
import random
class Mynet1(nn.Module): # teacher
def __init__(self):
super(Mynet1, self).__init__()
self.layer1 = nn.Linear(1, 1)
self.layer2 = nn.Linear(1, 1)
def forward(self, x):
out = self.layer1(x)
out2 = self.layer2(out)
return out2
class Mynet2(nn.Module): # student
def __init__(self):
super(Mynet2, self).__init__()
self.layer1 = nn.Linear(1, 1)
self.layerK = nn.Linear(1, 1)
self.layer2 = nn.Linear(1, 1)
def forward(self, x):
out = self.layer1(x)
out = self.layerK(out) # new defined layer
out2 = self.layer2(out)
return out2
```

**Train main.py**

```
# the teacher network, pretend that is well trained and defined and loaded
net1 = Mynet1()
opt1 = torch.optim.SGD(net1.parameters(), lr=0.01, weight_decay=3e-5, momentum=0.99, nesterov=True)
# the student network
net2 = Mynet2()
loss = nn.L1Loss()
################################################
## Here is how I defined opt2 and load opt1 ##
# frozen the layerK first!
for name, param in net2.named_parameters():
if 'K' in name:
param.requires_grad = False
else:
param.requires_grad = True
opt2 = torch.optim.SGD((filter(lambda p: p.requires_grad, net2.parameters())), lr=0.01)
state = opt1.state_dict()
opt2.load_state_dict(state)
opt2.add_param_group({'params': net2.layerK.parameters(), 'lr': 0.02, 'weight_decay': 1, 'momentum': 0.99, 'nesterov': True})
# release layerK so all the layers can be trained
for name, param in net2.named_parameters():
param.requires_grad = True
#################################################################
# training
net2.train()
for i in range(500):
x = y = random.randint(0, 10)
x = torch.tensor([x]).type(dtype=torch.float)
y = torch.tensor([y]).type(dtype=torch.float)
pred = net2(x)
opt2.zero_grad()
l = loss(pred, y)
l.backward()
opt2.step()
```

Could you please tell will the `net2`

and `opt2`

work as we want? like updating `layer1`

and `layer2`

with one settings while `layerK`

with other settings I defined?

Is it reasonable?