Continue trainning after saving model

Najeh_Nafti · November 3, 2021, 4:29pm

How can I save best model weights to continue training the model after stopping because of limited GPU resources?

ptrblck · November 3, 2021, 8:35pm

The ImageNet example would be a good reference for resuming the training.
E.g. here a checkpoint is loaded and the training is resumed while here the checkpoint giving the best validation accuracy is stored.

Najeh_Nafti · November 3, 2021, 8:48pm

I am working with a GAN model, so I don’t calculate the accuracy, should I save the best validation G_loss and D_loss?
I am using the below code during training, is it correct:

if  min_valid_loss_g > valid_loss_g :
     print(f'G_Val_Loss_Decreased({min_valid_loss_g:.6f}--->{valid_loss_g:.6f})\t Saving The Model')
     min_valid_loss_g = valid_loss_g
     torch.save({
            'epoch': epoch,
            'G_state_dict': G.state_dict(),
            'G_optimizer_state_dict': optimizer_G.state_dict()
            'G_loss': valid_loss_g
            },f"./generator-epoch-{epoch}.pth")

ptrblck · November 3, 2021, 8:57pm

Yes, your approach sounds reasonable assuming the validation loss properly represents the training progress of your model.

Najeh_Nafti · November 3, 2021, 9:13pm

How can I assume that the validation loss properly represents the training progress of my model?

The below code, is how I am uploading the saved models to continue training after restart the Kernel, is it correct?

G = Generator().to(device)
checkpoint = torch.load('generator.pth')
try:
    checkpoint.eval()
except AttributeError as error:
    print (error)


G.load_state_dict(checkpoint['G_state_dict'])
optimizer_G.load_state_dict(checkpoint['optimizer_state_dict_G'])
epoch = checkpoint['epoch']
loss = checkpoint['valid_loss_g']
G.eval()

Najeh_Nafti · November 6, 2021, 11:10pm

should I continue training on the same training dataset or should I modify it?

ptrblck · November 7, 2021, 9:11pm

The choice of the dataset depends on your use case. If you want to “continue” the training, then using the same dataset would work; if you want to fine-tune the model on another dataset then you would need to change it.

Najeh_Nafti · November 8, 2021, 11:02am

and how to load the saved models?

Najeh_Nafti · November 9, 2021, 11:22am

Using this technique, during the training I got the same loss for different epochs.
Where is the problem?

ptrblck · November 10, 2021, 7:37am

In your code snippet you are already loading the state_dicts in:

G.load_state_dict(checkpoint['G_state_dict'])
optimizer_G.load_state_dict(checkpoint['optimizer_state_dict_G'])

Check that gradients are calculated for each used parameter after the first backward pass. If some .grad attributes are set to None, your computation graph is detached. If that’s not the case, try to overfit a small dataset by playing around with hyperparameters.

Najeh_Nafti · November 10, 2021, 12:13pm

How can I check those gradients?

ptrblck · November 10, 2021, 8:12pm

You could iterate the parameters and print their .grad attribute:

loss.backward()
for name, param in model.named_parameters():
    print('{}, {}'.format(name, param.grad))