Combining Trained Models in PyTorch

Muhammad_Ali · March 4, 2021, 6:44am

Is there any way, that I can use ASTER model, only for calculating CTC loss, between text-preds and labels, and then backpropagate loss, to update the generator params only???
Also, I am using this as an intermediate model, to make output from generator compatible for input to ASTER model…

class BridgeModel(nn.Module):

  def __init__(self):
    super(BridgeModel, self).__init__()
    pass


  def forward(self,x):

    #out=torch.cat([x,x,x],1)
    out=x.repeat(1,3, 1, 1)
    out=nn.functional.interpolate(out,(128,128))
    out.sub_(0.5).div_(0.5)
    input_dict={}
    input_dict['images']=out.to(device)
    input_dict['rec_targets'] = torch.IntTensor(1, args.max_len).fill_(1).to(device)
    input_dict['rec_lengths'] = [args.max_len]
    return input_dict

Thanks again for your help…

ptrblck · March 4, 2021, 7:27am

The .requires_grad attribute of parameter of the model, which should be updated, should not be set to False.

You can disable the gradient calculation for models, which should be frozen as seen in this example:

modelA = nn.Linear(1, 1)
modelB = nn.Linear(1, 1)
for param in modelB.parameters():
    param.requires_grad = False

out = modelA(torch.randn(1, 1))
out = modelB(out)
out.backward()

for param in modelA.parameters():
    print(param.grad) # valid grads
    
for param in modelB.parameters():
    print(param.grad) # None

torch.backends.cudnn.enable = False or the context manager with torch.backends.cudnn.flags(enabled=False) would disable cudnn.

Muhammad_Ali · March 4, 2021, 7:42am

Thanks, but I was setting requires_grad of ASTER model, (WHICH NEED NOT TO BE UPDATED)

Now, I think it’s working, but both models params-gradient are set to true.But while defining OPTIMIZER, I have set it for generator.parameters(), so the ASTER model weights will not be updated, and only generator weights will be updated,(AM I RIGHT)

fabraz · April 20, 2021, 6:17pm

I’ve followed this discussion, especially the recommendation from @ptrblck, to create the following model:

class VGG(nn.Module):
    def __init__(self, features, output_dim):
        super().__init__()
        
        self.features = features
        
        self.avgpool = nn.AdaptiveAvgPool2d(7)
        
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.LeakyReLU(inplace = True),
            nn.Dropout(0.5),
            nn.LeakyReLU(4096, 4096),
            nn.ReLU(inplace = True),
            nn.Dropout(0.5),
            nn.Linear(4096, output_dim),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x) 
        h = x.view(x.shape[0], -1)
        x = self.classifier(h)
        return x

class TwoPagesModule(nn.Module):
    def __init__(self, prevPageVGG, targPageVGG):
        super(TwoPagesModule, self).__init__()
        self.prevPageVGG = prevPageVGG
        self.targPageVGG = targPageVGG
        self.classifier = nn.Sequential(
            nn.LeakyReLU(inplace = True),
            nn.Dropout(0.5),
            nn.Linear(2*256, 2)
        )

    def forward(self, x1, x2):
        x1 = self.prevPageVGG(x1)
        x2 = self.targPageVGG(x2)
        
        x = torch.cat((x1, x2), dim=1)
        x = self.classifier(x)
        return x

The final model would be like this. After loading the VGG16 pre-trained, I set autograd as false from the CNN layers that come before the 28.

TwoPagesModule(
  (prevPageVGG): VGG(
    (features): Sequential(
      (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
      (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (3): ReLU(inplace=True)
      (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (8): ReLU(inplace=True)
      (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (11): ReLU(inplace=True)
      (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (13): ReLU(inplace=True)
      (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (15): ReLU(inplace=True)
      (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (18): ReLU(inplace=True)
      (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (20): ReLU(inplace=True)
      (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (22): ReLU(inplace=True)
      (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (25): ReLU(inplace=True)
      (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (27): ReLU(inplace=True)
      (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (29): ReLU(inplace=True)
      (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (avgpool): AdaptiveAvgPool2d(output_size=7)
    (classifier): Sequential(
      (0): Linear(in_features=25088, out_features=256, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): Dropout(p=0.5, inplace=False)
    )
  )
  (targPageVGG): VGG(
    (features): Sequential(
      (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): ReLU(inplace=True)
      (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (3): ReLU(inplace=True)
      (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (8): ReLU(inplace=True)
      (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (11): ReLU(inplace=True)
      (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (13): ReLU(inplace=True)
      (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (15): ReLU(inplace=True)
      (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (18): ReLU(inplace=True)
      (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (20): ReLU(inplace=True)
      (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (22): ReLU(inplace=True)
      (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (25): ReLU(inplace=True)
      (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (27): ReLU(inplace=True)
      (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (29): ReLU(inplace=True)
      (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (avgpool): AdaptiveAvgPool2d(output_size=7)
    (classifier): Sequential(
      (0): Linear(in_features=25088, out_features=256, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): Dropout(p=0.5, inplace=False)
    )
  )
  (classifier): Sequential(
    (0): LeakyReLU(negative_slope=0.01, inplace=True)
    (1): Dropout(p=0.5, inplace=False)
    (2): Linear(in_features=512, out_features=2, bias=True)
  )
)

I want to set different learning rates for the VGG model parameters and for the ensemble ones. I was trying to use something like the following but it does not work.

params = [
          {'params': model.prevPageVGG.features.parameters(), 'lr': FOUND_LR / 10},
          {'params': model.targPageVGG.features.parameters(), 'lr': FOUND_LR / 10},
          {'params': model.prevPageVGG.classifier.parameters()},
          {'params': model.targPageVGG.classifier.parameters()},
          {'params': model.classifier.parameters(),},
         ]

optimizer = optim.Adam(params, lr = FOUND_LR)

the error:

ValueError: some parameters appear in more than one parameter group

Can anyone help me find the best way to set the learning rate for the problem exposed?

Thanks in advance!

TheOraware · June 24, 2021, 10:29am

@ptrblck i am trying to achieve the same , your code make sense. I dont know why i am failing to implement the same , could you please point out where i am wrong

1- I pre trained two model multiNetA and multiNetB separately , both are for classification data from 0-8
2- both has same number of data/features to train

After training when i blend them using your code

class MyEnsemble(nn.Module):
    def __init__(self, multiNetA, multiNetB):
        super(MyEnsemble, self).__init__()
        self.modelA = multiNetA
        self.modelB = multiNetB
        self.classifier = nn.Linear(9, 9)
        
    def forward(self, x1, x2):
        x1 = self.modelA(x1)
        print(f'x1 is {x1.shape}')
        x2 = self.modelB(x2)
        print(f'x2 is {x2.shape}')
        x  = torch.cat((x1, x2), dim=1)
        x  = self.classifier(F.relu(x))
        return x

#Create models and load state_dicts  
NUM_FEATURES = train_tensorA.shape[1]
NUM_CLASSES  = 9   
modelA = multiNetA(NUM_FEATURES,NUM_CLASSES)

NUM_FEATURES = train_tensorB.shape[1]
NUM_CLASSES  = 9   
modelB = multiNetB(NUM_FEATURES,NUM_CLASSES)

modelA.load_state_dict(torch.load('ModelA.pth'))
modelB.load_state_dict(torch.load('ModelB.pth'))

model = MyEnsemble(modelA, modelB)

print(model)
output = model(train_tensorA,train_tensorB)
print(output)

it gives me following error

MyEnsemble(
  (modelA): multiNetA(
    (lin0): Linear(in_features=75, out_features=46, bias=True)
    (lin1): Linear(in_features=46, out_features=48, bias=True)
    (lin2): Linear(in_features=48, out_features=9, bias=True)
    (noise): GaussianNoise()
    (bn0): BatchNorm1d(75, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (bn1): BatchNorm1d(46, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (bn2): BatchNorm1d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout_l0): Dropout(p=0.5, inplace=False)
    (dropout_l1): Dropout(p=0.5, inplace=False)
  )
  (modelB): multiNetB(
    (lin0): Linear(in_features=75, out_features=46, bias=True)
    (lin1): Linear(in_features=46, out_features=48, bias=True)
    (lin2): Linear(in_features=48, out_features=9, bias=True)
    (noise): GaussianNoise()
    (bn0): BatchNorm1d(75, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (bn1): BatchNorm1d(46, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (bn2): BatchNorm1d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout_l0): Dropout(p=0.2722241899012789, inplace=False)
    (dropout_l1): Dropout(p=0.2258820493932866, inplace=False)
  )
  (classifier): Linear(in_features=9, out_features=9, bias=True)
)
x1 is torch.Size([139843, 9])
x2 is torch.Size([160000, 9])

RuntimeError                              Traceback (most recent call last)
<ipython-input-244-38737b026d5b> in <module>
     30 
     31 print(model)
---> 32 output = model(train_tensorA,train_tensorB)
     33 print(output)

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

<ipython-input-244-38737b026d5b> in forward(self, x1, x2)
     11         x2 = self.modelB(x2)
     12         print(f'x2 is {x2.shape}')
---> 13         x  = torch.cat((x1, x2), dim=1)
     14         x  = self.classifier(F.relu(x))
     15         return x

RuntimeError: Sizes of tensors must match except in dimension 1. Got 139843 and 160000 in dimension 0 (The offending index is 1)

Following is the shape of train_tensorA and train_tensorB

print(train_tensorA.shape,train_tensorB.shape)

torch.Size([134412, 75]) torch.Size([6214, 75])

What i am doing wrong ?

ptrblck · June 24, 2021, 7:03pm

It seems you are using two tensors with a different batch size.
While the train_tensorX shapes are given as:

torch.Size([134412, 75]) torch.Size([6214, 75])

inside the forward method they are printed as:

x1 is torch.Size([139843, 9])
x2 is torch.Size([160000, 9])

and I don’t know where this difference is coming from (you should check, if modelX is changing the batch size, which most likely would be wrong.

In any case, you won’t be able to concatenate two tensors in dim1, if other dimensions have a different shape, so make sure to use the same number of samples for both tensors.

TheOraware · June 25, 2021, 10:44am

@ptrblck cant i train two models with different data but generated from same population and blend them?

ptrblck · June 25, 2021, 8:48pm

I think it could be possible, but of course you would have to run some experiments.
I don’t see any “technical” limitations coming from the framework.

TheOraware · June 26, 2021, 5:18am

@ptrblck this is what i tried in my first attempt which failed

x1 is torch.Size([139843, 9])
x2 is torch.Size([160000, 9])

when i train both model with same number of rows and dimensions (offcourse different NN architecture) then it works. What should i look more please point me

ptrblck · June 26, 2021, 5:21am

The issue in this case is that you are trying to use tensors with a different number of samples, while concatenating them in dim1, which will not work. Here is a small example:

a = torch.randn(2, 1)
b = torch.randn(3, 2)
print(a)
> tensor([[-0.5933],
          [-0.5581]])
print(b)
> tensor([[ 0.8039, -0.2030],
          [ 0.4271,  0.0893],
          [-1.0381,  0.5928]])

As you can see, you cannot concatenate (see it as “appending” one tensor to the other) both tensors, so you have to make sure all dimensions besides the one used in torch.cat must have the same shape.

TheOraware · June 26, 2021, 5:25am

@ptrblck thanks that make sense , actually what i am trying to achieve is i have 8 classes two of them are very rare class. Oversampling with any technique does not work , denoising autoencoder not learning these rare classes as a latent space. I thought training one model without rare classess and training another model with only rare classes and then blend them might work. Is there any other approach please?

chericha · May 9, 2022, 2:31pm

Hello @ptrblck,

I wanted to ask if we have different models with different feature ranges at the output of self.fc1, do we need to make any modifications to make them in the same range? For instance, if ModelA’s self.fc1 output is between [-a, a] whereas ModelB’s self.fc1 output is between [-b,b] (a and be are real numbers), should we make them be in the same range before concatenating features and applying ReLU?

ptrblck · May 9, 2022, 2:43pm

I haven’t experimented with different approaches so take this with a grain of salt, but I would assume that normalizing the feature tensors to the same or similar range would be beneficial. In a previous post i think i showed a toy example where the scale difference was ~100x and the lower scale signal was thus treated as noise by the model.

r00bi · June 30, 2022, 8:21pm

Hi, thanks for this example: what if we have several pretrained weights on the Model A (as results of training on different datasets)and want to merged them together to be used again by Model A on a new dataset? How to merge several pretrained weights to feed to the model A, then keep training the same model with new dataset?

ptrblck · June 30, 2022, 9:27pm

I don’t know how weights from differently trained models can be “merged” and don’t think that e.g. taking the mean of these two parameter sets would directly work.

r00bi · June 30, 2022, 10:54pm

So, Do have any ideas how we can take advantage of having several pretrained weights to feed them to a model with new large dataset?

ptrblck · June 30, 2022, 11:05pm

No, unfortunately I don’t know how multiple modes can be “combined” to a single one (unless you are using a model ensemble, i.e. are using all models separately and are passing their outputs/features to another stage), so we might need to wait for others in case they know how this can be achieved.

r00bi · June 30, 2022, 11:14pm

So I use ensemble to read different pretrained weights to combine for feeding the model.
Thank you.

r00bi · June 30, 2022, 11:51pm

In my case, since I have only one model with three pretrained weights, so i was thinking to use it in this way:
Model1 = MyModel()
Model2 = MyModel()
Model3= MyModel()
Model1.load_state_dict(pretrained-dataset1)
Model2.load_state_dict(torch.load(pretrained-dataset2))
Model3.load_state_dict(torch.load(pretrained-dataset3))
model = MyEnsemble(Model1, Model2, Model3)
output = model(new_x)
Just load three pretrained weights combine them and then use new dataset as new_x to train the same model.