I have written a model, the architecture is follows:

```
CNNLSTM(
(cnn): CNNText(
(embed): Embedding(19410, 300, padding_idx=0)
(convs1): ModuleList(
(0): Conv2d(1, 32, kernel_size=(3, 300), stride=(1, 1))
(1): Conv2d(1, 32, kernel_size=(5, 300), stride=(1, 1))
(2): Conv2d(1, 32, kernel_size=(7, 300), stride=(1, 1))
)
(dropout): Dropout(p=0.6)
(fc1): Linear(in_features=96, out_features=1, bias=True)
)
(lstm): RNN(
(embedding): Embedding(19410, 300, padding_idx=0)
(rnn): LSTM(300, 150, batch_first=True, bidirectional=True)
(attention): Attention(
(dense): Linear(in_features=300, out_features=1, bias=True)
(tanh): Tanh()
(softmax): Softmax()
)
(fc1): Linear(in_features=300, out_features=50, bias=True)
(dropout): Dropout(p=0.5)
(fc2): Linear(in_features=50, out_features=1, bias=True)
)
(fc1): Linear(in_features=146, out_features=1, bias=True)
)
```

I have used the RNN and the CNN differently on the same dataset and I have the weights saved. In the mixed model, I load the weights using the following function:

```
def load_pretrained_weights(self, model='cnn', path=None):
if model not in ['cnn', 'rnn']:
raise AttributeError("Model must be either rnn or cnn")
if model == 'cnn':
self.cnn.load_state_dict(torch.load(path))
if model == 'rnn':
self.lstm.load_state_dict(torch.load(path))
```

And freeze the sub modules using the function:

```
def freeze(self):
for p in self.cnn.parameters():
p.requires_grad = False
for p in self.lstm.parameters():
p.requires_grad = False
```

Then I train the model, and got better result compared to the each submodule trained and

evaluated alone.

I used an early-stopping technique in my epoch loop to save the best parameters.

After training I made a new instance of the same class and when I load the saved “best” parameters I am not getting similar result.

I tried the same thing with each submodule (RNN and CNNText here) alone, it worked. But in this case it is not giving the same performance.

Few Experiments I tried:

- I loaded the saved weights of each submodule and loaded the best parameters, got somehow close to the best result.
- Took the hidden layer from each submodule before applying the dropout, got better than the previous, but not the best!

Please help me understand it what is happening here. I am new to Deep Learning concepts.

Thank you.