Is it okay to separate (one model & one optimizer) to (two model & two optimizer)?

By adding a few layers to the pretrained model(PretrainedModel), I want to train new model(MyNewModel), where the existing pretrained layers(starting with pretrained weights) and newly added layers are trained simultaneously.

class PretrainedModel(nn.Module):
       def __init__(self):
            self.layer_s=PretrainedModule(...)
       def forward(self,....):
            outputs=self.layer_s(...)
            return outputs
class MyNewModel(nn.Moudle):
       def __init__(self):
            self.layer_s=PretrainedModule(...)
            self.new_layer_s=NewModule(...)
       def forward(self,....):
            outputs=self.layer_s(...)
            outputs=self.new_layer_s(outputs)
            return outputs

net=MyNewModel(...)
net.load_state_dict(....) # load parameters, that are belong to PretrainedModule
optim = Adam(net.parameters(),lr=1e-4)

for idx,batch in enumerate(training_set):
     output=net(...)
     loss = loss_fn(output)
     loss.backward()
     optim.step()

Is it okay to train MyNewModelby splitting in two model and using two optimizer?
For example,

class PretrainedModel(nn.Module):
       def __init__(self):
            self.layer_s=PretrainedModule(...)
       def forward(self,....):
            outputs=self.layer_s(...)
            return outputs
class AddedLayers(nn.Moudle): 
       def __init__(self):
            self.new_layer_s=NewModule(...)
       def forward(self,....):
            outputs=self.new_layer_s(...)
            return outputs

pretrained_net=PretrainedModel(...)
pretrained_net.load_state_dict(torch.load('pretrained.pt'))
optimizer_pretrained= Adam(pretrained_net.parameters(),lr=1e-4) 

added_net=AddedLayers(...)
optimizer_added= Adam(added_net.parameters(),lr=1e-4) 

for idx, batch in enumerate(training_set):
     output=pretrained_net(...)
     output=added_net(output)
     loss = loss_fn(output)
     loss.backward()
     optimizer_pretrained.step()
     optimizer_added.step()

Or do I have to run with upper case?

Yes, your second approach is totally fine and you can “split” your model into submodules as well as different optimizers. Just make sure to zero out the gradients in your training loop as this part is missing :wink:

1 Like

Thanks for kind explanation! To make code more cleaned. I tried to use one optimizer with different learning rate for different module. My code snippet is below. Is it okay?

class PretrainedModel(nn.Module):
       def __init__(self):
            self.layer_s=PretrainedModule(...)
       def forward(self,....):
            outputs=self.layer_s(...)
            return outputs
class AddedLayers(nn.Moudle): 
       def __init__(self):
            self.new_layer_s=NewModule(...)
       def forward(self,....):
            outputs=self.new_layer_s(...)
            return outputs

pretrained_net=PretrainedModel(...)
pretrained_net.load_state_dict(torch.load('pretrained.pt'))

added_net=AddedLayers(...)

pretrained_params = list(map(lambda x: x[1],pretrained_net.named_parameters()))
added_params = list(map(lambda x: x[1],added_net.named_parameters()))

optimizer = Adam([{'params': pretrained_params},{'params': added_params,'lr':'1e-5'}],lr=1e-6)


for idx, batch in enumerate(training_set):
     output=pretrained_net(...)
     output=added_net(output)
     loss = loss_fn(output)
     loss.backward()
     optimizer_pretrained.step()
     optimizer_added.step()
     optimizer.zero_grad()

Yes, the code looks generally alright.

I don’t quite understand the list(map(lambda ...) usage to create the parameters and think just using model.parameters() or list(model.parameters()) should do the same.
In any case, you would still need to fix the optmizer.step() calls, since now you are creating one optimizer, but are trying to call the step() method on two different optimizers.

1 Like