Combining Trained Models in PyTorch

You should set the require_grad attribute of all parameters of modelA and modelB to False and leave it as True for the final classifier.


Is it possible to save such an ensemble model using JIT (provided that probably all the models are in JIT)?

Thank you.

Hello @ptrblck,

def init(self, encoded_image_size=14):
super(Encoder, self).init()
self.enc_image_size = encoded_image_size

    resnet = torchvision.models.resnet101(pretrained=True) 

    modules = list(resnet.children())[:-2]
    self.resnet = nn.Sequential(*modules)

densenet = torchvision.models.densenet121(pretrained='imagenet')

    modules_1 = list(densenet.children())[:-2]
    self.resnet = nn.Sequential(*modules_1)

    self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

def forward(self, images):

    out = self.resnet(images) 
    out1 = self.densenet (densenet)

    #after tweaking dimensions...

    x =,out1) 

    z = self.adaptive_pool(x)

is this code snippet makes sense!!!?

The code looks generally OK, but I wouldn’t recommend to create new models via passing the child modules to nn.Sequential.
Using this approach you would call each submodule in a sequential way and would therefore lose all functional calls, which were used in the original forward.
E.g. for DenseNet you would lose these calls.

If you just want to remove the last layer, replace it with nn.Identity.
On the other hand, if you want to manipulate more layers or the forward pass in general, I would recommend to create a custom model, derive from the corresponding torchvision model, and change the forward appropriately.

@ptrblck, Can you please clarify about backward(), i.e. error will be flowing to both modelA and modelB or it will be limited upto ensemble model only?

One more thing please, if I will save ensemble model, will it save weights including modelA and modelB or only linear layer of ensemble model.

The gradients will be calculated for the parameters of all submodules as well as the final classifier, if you didn’t disable it via .requires_grad=False.
Also the state_dict() will return all parameters of all submodules and the final classifier.

1 Like

So, If I want to use modelA just in inference mode and modelB and linear layer of ensemble model in training mode, can I use this way or gradient will flow in modelA as well just because it is is forward pass of ensemble model?

class MyEnsemble(nn.Module):
    def __init__(self, modelA, modelB):
        super(MyEnsemble, self).__init__()
        self.modelA = modelA
        self.modelB = modelB
        self.classifier = nn.Linear(4, 2)
    def forward(self, x1, x2):
        with torch.no_grad():
            x1 = self.modelA(x1)
        x2 = self.modelB(x2)
        x =, x2), dim=1)
        x = self.classifier(F.relu(x))
        return x

This code would not calculate any gradients for modelA. If that’s your use case, then you should be fine.

1 Like

@ptrblck Hello I had doubt about the MyEnsemble class when the loss.backward() is called will the MyModelA and MyModelB (layers inside them) weights updated? Thanks in advance

Yes, in my example both sub-models would be updated.
These models are treated as any other layer and you could imagine swapping them for e.g. nn.Linear.

1 Like

Thanks alot for replying!

Lets say I load a ModelA with pre-trained weights, and then add some additional layers operating on outputs of ModelA layers. I only want to train these new layers (at least for starters).
I understand that I need to set the requires_grad to False for ModelA, and to True for the new layers.
But should I also set ModelA to eval mode during training? So that the BatchNorm layers in it behave as they would during inference? Or will they do that anyway with requires_grad=False?

Hi, I have a model A which is trained on 50 classes and now I want to add 10 more different classes so how can I train my model without effecting the previous results ?
Finally I want to predict 60 classes and I don’t want to re-train my model from very beginning.


Usually you would either replace the last linear layer to increase the number of classes.
Even if you copy the pretrained (partial) parameters to this layer, you would still retrain all class outputs, if you don’t manually zero out the gradients of the pretrained weights.

If you don’t want to “touch” the previously trained outputs at all, you could probably add a completely new linear layer for the new classes and use the penultimate features as its input.
Afterwards you could concatenate the “old” and “new” linear layer outputs to calculate the loss.
This would be similar to the aforementioned approach, but would avoid having to zero out the gradients etc. manually and might be a bit easier to implement.

Let me know, if this would work for you.

1 Like

Hi, were you able to find an answer to your question?

I did not, no. But I decided to set the frozen parts of the network (where requires_grad is False) to eval mode. And I overloaded the train(..) method to keep it that way when calling train() or eval() on the network from the training script.

I am using Faster RCNN pretrained model and I am trying to add new layer using this code:

for p in model.parameters():
model.roi_heads.box_head.fc8 = torch.nn.Linear(in_features, 1024,bias=True)

I am adding 8th layer and trying to freeze parameters of all previous layers but I am getting this error.

Element 0 of tensors does not require grad and does not have a grad_fn

please explain this along with sample code,Thanks in advance

model.roi_heads.box_head returns the TwoMLPHead module, which flattens the input and applies two nn.Linear layers with a relu non-linearity.
You won’t be able to just add a new attribute (fc8) to this module without changing the forward method as well.
The proper workaround would be to create a custom TwoMLPHead layer with 3 linear layers and replace box_head with it.
A hacky workaround would be to redefine fc7 as an nn.Sequential container and add the new linear layer to it:

model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.roi_heads.box_head.fc7 = nn.Sequential(
    nn.Linear(model.roi_heads.box_head.fc7.in_features, model.roi_heads.box_head.fc7.out_features)
1 Like

Could you extend this small example to show how to apply soft-parameter sharing across modelA and modelB?