Save the best model

    model = model.train()
    best_accuracy = 0
    ...
    for epoch in range(100):
        for idx, data in enumerate(data_loader):
                        ...
        if cur_accuracy > best_accuracy:
            best_model = model
    torch.save(best_model.state_dict(), 'model.pt')

In this way, the best accuracy model is saved well?

4 Likes

This code won’t work, as best_model holds a reference to model, which will be updated in each epoch.
You could use copy.deepcopy to apply a deep copy on the parameters or use the save_checkpoint method provided in the ImageNet example.
Here is a small example for demonstrating the issue with your code:

model = nn.Linear(10, 2)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for epoch in range(10):
    optimizer.zero_grad()
    output = model(torch.randn(1, 10))
    loss = criterion(output, torch.randn(1, 2))
    loss.backward()
    optimizer.step()
    
    # Save 2nd epoch
    if epoch == 2:
        best_model = model  # Won't work!
        #best_model = copy.deepcopy(model)  # Will work
        
# Compare models
for param1, param2 in zip(best_model.parameters(), model.parameters()):
    print((param1 == param2).all())
8 Likes

hi, I would like to know your code how to save the best model and the accuracy how to compare in different epochs?
thank you very much

Usually you would calculate the validation error/loss and save the best performing model (i.e. with the highest validation accuracy).
Have a look at the ImageNet example to see, how save_checkpoint is used for the best accuracy.

2 Likes

ok1,thank you very much

Hi, thanks for the graet answer.
I would like to know how to use the code about “compare models”, does it used for choosing the best trained model or just check if the two models are identical.
Thx!

My code snippet was just showing that the original code is not working, as no deepcopy was performed.
I would recommend to stick to the linked ImageNet example.

Honestly, this kind of stuff should be mentioned in the docs
Current example of storing the best model in the doc will lead to exactly this kind of bugs. And it happened already. We have been using the overfitted model in prod for months :frowning:

I’m sorry to hear you’ve had this trouble. :confused:
Would you be interested in adding this use case into the docs?

Yep. Created a merge request.

Here’s what works for me:

  model = model.train()
    best_accuracy = 0
    ...
    for epoch in range(100):
        for idx, data in enumerate(data_loader):
                        ...
        if cur_accuracy > best_accuracy:
                torch.save(model.state_dict(), 'best_model.pt')
1 Like

Thanks for your code snippet. Do you continue training with this best model from here or just save it for using at last? What are the downsides of doing the former? @alx

Yes, it continues to the next epoch until it hits a better accuracy. If you use the same filename (here ‘best_model.pt’) it will replace the prior one. If you add a prefix ‘epoch_5_best_model.pt’ you will end up with a list of .pt files at the end of your run.

Personally I prefer keeping them all in case of overfitting.

@alx Thanks for sharing the code snippet. I think in the if condition, best_accuracy has to be updated with the curr_accuracy.

as for regression problem that is using MSELoss for loss criterion, do you suggest using the lowest loss for the best model to be saved? What is the best practices for saving the best model when we have a regression problem?

is this correct for binary classification problem?

            if val_acc > best_pred: 
                best_pred = val_acc
                best_epoch = epoch
                best_preds = val_preds
                best_val_labels = val_labels
                if not test:
                    print("saving model...")
                    torch.save(model.state_dict(), model_path + task_name + ".pth")

I have this for a regression problem. I am not sure why the wrong epoch is chosen for best_epoch for saving the model.

Epoch 019: | Train Loss: 0.02398 | Val Loss: 0.01437
*********************************************
    epochs variable     value
0        0    train  7.781681
1        1    train  0.099485
2        2    train  0.042489
3        3    train  0.032492
4        4    train  0.027184
5        5    train  0.027754
6        6    train  0.028582
7        7    train  0.027290
8        8    train  0.025891
9        9    train  0.027295
10      10    train  0.024953
11      11    train  0.026291
12      12    train  0.024126
13      13    train  0.024956
14      14    train  0.027561
15      15    train  0.024773
16      16    train  0.027174
17      17    train  0.024941
18      18    train  0.028399
19      19    train  0.023982
20       0      val  0.052594
21       1      val  0.020137
22       2      val  0.017723
23       3      val  0.015413
24       4      val  0.018655
25       5      val  0.016836
26       6      val  0.015303
27       7      val  0.018473
28       8      val  0.015580
29       9      val  0.015238
30      10      val  0.016674
31      11      val  0.017407
32      12      val  0.019380
33      13      val  0.017760
34      14      val  0.015137
35      15      val  0.015405
36      16      val  0.021744
37      17      val  0.019030
38      18      val  0.015155
39      19      val  0.014370
best_epoch:  16
best_val_loss:  tensor(0.0041, device='cuda:0')

and this code in order to find the best model (not sure if the logic makes sense for regression):

if val_loss < best_val_loss: 
                best_val_loss = val_loss
                best_epoch = epoch
                print('best epoch is: ', best_epoch)
                best_preds = val_preds
                best_val_target = val_targets
            # if val_acc > best_pred: 
            #     best_pred = val_acc
            #     best_epoch = epoch
            #     best_preds = val_preds
            #     best_val_labels = val_labels

                
                if not test:
                    print("saving model...")
                    torch.save(model.state_dict(), model_path + task_name + ".pth")

also

best_val_loss = 100.0
best_epoch = 0

actually, I figured my mistake but still don’t know if my method makes sense to you for best_epoch for saving the best model in regression problem:

Epoch 019: | Train Loss: 0.02677 | Val Loss: 0.01631
*********************************************
    epochs variable     value
0        0    train  7.661003
1        1    train  0.086944
2        2    train  0.038162
3        3    train  0.030772
4        4    train  0.024799
5        5    train  0.026888
6        6    train  0.024464
7        7    train  0.025889
8        8    train  0.025391
9        9    train  0.026814
10      10    train  0.025069
11      11    train  0.023473
12      12    train  0.022358
13      13    train  0.022658
14      14    train  0.023542
15      15    train  0.023952
16      16    train  0.027181
17      17    train  0.024417
18      18    train  0.025738
19      19    train  0.026772
20       0      val  0.108998
21       1      val  0.019741
22       2      val  0.027762
23       3      val  0.015706
24       4      val  0.017672
25       5      val  0.028519
26       6      val  0.021219
27       7      val  0.015313
28       8      val  0.017817
29       9      val  0.015510
30      10      val  0.016199
31      11      val  0.014752
32      12      val  0.017109
33      13      val  0.015249
34      14      val  0.015251
35      15      val  0.027797
36      16      val  0.019257
37      17      val  0.021981
38      18      val  0.015022
39      19      val  0.016308
best_epoch:  11
best_val_loss:  0.014752049872186035

and

            val_loss = val_epoch_loss/len(dataloader_val)
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_epoch = epoch
                print('best epoch is: ', best_epoch)
                best_preds = val_preds
                best_val_target = val_targets
            # if val_acc > best_pred: 
            #     best_pred = val_acc
            #     best_epoch = epoch
            #     best_preds = val_preds
            #     best_val_labels = val_labels

                
                if not test:
                    print("saving model...")
                    torch.save(model.state_dict(), model_path + task_name + ".pth")

I fixed the problem by adding this line:
val_loss = val_epoch_loss/len(dataloader_val)

Saving the “best” chckpoint is usually done by checking the validation metrics and selecting the model state with the highest val metric or lowest val loss. It’s not depending on the actual use case (e.g. binary classification, regression etc.).

I don’t know how the output is created, as it seems you are running multiple epochs for the validation, while you would calculate the validation accuracy/loss once per epoch.

1 Like

I did the same method for saving the best model, but I have a weird issue!
When I load the saved model for later prediction (inference), the model output is actually random! I checked the weight of the best_model and the loaded_model manually and it seems everything is ok but the predictions by loaded_model are totally incorrect while the one in the Jupyter Notebook works fine.

This is my model architecture:


class Char2Vec(nn.Module):
    def __init__(self, vocab_size, embed_dim, out_ch1= CFG.out_ch1, out_ch2= CFG.out_ch2):
        super().__init__()
        self.out_ch1, self.out_ch2 = out_ch1, out_ch2
        self.embeds = nn.Embedding(vocab_size, embed_dim, padding_idx=0) # first embedding layer for characters
        self.conv1 = nn.Sequential(
            nn.Conv1d(in_channels=embed_dim, out_channels=out_ch1, kernel_size=3),
            nn.ReLU(),
            nn.Dropout(.1),
        )
        self.convs2 = nn.ModuleList(
            [
                nn.Sequential(
                    nn.Conv1d(out_ch1, out_ch2//3, kernel_size=k),
                    nn.ReLU(),
                )
                for k in [3, 4, 5]
            ]
        )
        self.linear = nn.Sequential(
            nn.Linear(out_ch2, out_ch2),
            nn.ReLU(),
        )

    def forward(self, word):
        embeds = self.embeds(word).transpose(-2,-1)
        batch, sent, emb, seq = embeds.shape
        conv1 = self.conv1(embeds.view(-1, emb, seq))
        tmp = [cnn(conv1).max(dim=-1)[0].squeeze() for cnn in self.convs2]
        conv2 = torch.cat(tmp, dim=1)
        lin = self.linear(conv2)
        return (lin+conv2).view(batch, sent, -1)

class BiLSTMtagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, tagset_size):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = Char2Vec(Data.char_vocab_size, Data.d)
        self.lstm = nn.LSTM(
            input_size  = embedding_dim,
            hidden_size = hidden_dim,
            num_layers  = 2,
            batch_first = True,
            bidirectional = True,
            dropout     = 0.3
        )
        self.hidden2tag = nn.Linear(hidden_dim*2, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds)
        tag_space = self.hidden2tag(lstm_out)
        return F.log_softmax(tag_space, dim=1)

I save it like the following:

torch.save(best_model.state_dict(), 'bestmodel.pt')

and then for prediction, I do something like the below:

model = BiLSTMtagger(EMBEDDING_DIM, HIDDEN_DIM, TAGSET_SIZE)
state = torch.load('bestmodel.pt')
model.load_state_dict(state)
model.eval()
out = model(x).argmax(dim=-1)[0].tolist()
print(out)

I am so confused, and I’ve struggled with it the whole day without any success. I appreciate any help.

Could you describe what exactly you are comparing?
Are you using exactly the same inputs and are you calling model.eval() in both cases?
If not, make sure the test in the training script (before saving the state_dict) and the one in your inference script are as equal as possible and check if the error is still visible.