How resume the saved trained model at specific epoch

I did save the model with 150 epoch by this way torch.save(model.state_dict(), 'train_valid_exp4.pth')

I can load the model and test it by model.load_state_dict(torch.load('train_valid_exp4.pth')) which I assume returning me a model in last epoch.
My model seems is performing better at epoch 40, so the question is how can I resume the model at epoch 40?

1 Like

The history of past epochs are not saved. If you need to go back to epoch 40, then you should have saved the model at epoch 40.

Also, in addition to the model parameters, you should also save the state of the optimizer, because the parameters of optimizer may also change after iterations.

1 Like

@vmirly1 Thank you. How can I save model after each epoch?

here is the snippet which I put the save at the end.


criterion = nn.NLLLoss()
#optimizer = optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), eps=1e-08)  # 1e-3
optimizer = optim.Adam(model.parameters(), lr=0.00001)  # 1e-3



def train_valid_model():
    
    num_epochs=150
                      
    since = time.time()
    out_loss = open("history_loss_exp5.txt", "w")
    out_acc = open("history_acc_exp5.txt", "w")

    losses=[]
    ACCes =[]
    #losses = {}

       
    for epoch in range(num_epochs):  # loop over the dataset multiple times
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 30)
        
        # Each epoch has a training and validation phase
        for phase in ['train', 'valid']:
            if phase == 'train':
                
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode
        
            train_loss = 0.0
            total_train = 0
            correct_train = 0

            #iterate over data
            for t_image, mask, image_paths, target_paths in dataLoaders[phase]:
                
                 
                # get the inputs
                t_image = t_image.to(device)
                mask = mask.to(device)
                                
                 # zeroes the gradient buffers of all parameters
                optimizer.zero_grad()
                
                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(t_image) 
                    _, predicted = torch.max(outputs.data, 1)
                    loss = criterion(outputs, mask) # calculate the loss
            
                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward() # back propagation
                        optimizer.step() # update gradients                        
            
                # accuracy
                train_loss += loss.item()
                total_train += mask.nelement()  # number of pixel in the batch
                correct_train += predicted.eq(mask.data).sum().item() # sum all precited pixel values
                
            epoch_loss = train_loss / len(dataLoaders[phase].dataset)
            #losses[phase] = epoch_loss
            losses.append(epoch_loss)
                            
            epoch_acc = 100 * correct_train / total_train
            ACCes.append(epoch_acc)
                                             
            print('{} Loss: {:.4f} {} Acc: {:.4f}'.format(phase, epoch_loss, phase, epoch_acc))     

            out_loss.write('{} {} Loss: {:.4f}\n'.format(epoch, phase, epoch_loss))
            out_acc.write('{} {} ACC: {:.4f}\n'.format(epoch, phase, epoch_acc))

            #numpy.savetxt('loss.csv', (losses), "%.4f", header= 'loss', comments='', delimiter = ",")
            #numpy.savetxt('ACC.csv', (ACCes), "%.4f", header= 'accuracy', comments='', delimiter = ",")                    
            numpy.savetxt('loss_acc_exp5.csv', numpy.c_[losses, ACCes], fmt=['%.4f', 'd'], header= "loss, acc", comments='', delimiter = ",")                    
        
            
            
    print()
    time_elapsed = time.time() - since
    
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    
    torch.save(model.state_dict(), 'train_valid_exp4.pth')   

You just need to move the above line inside the for-loop (for epoch in ...) , and give it a filename that has epoch number as well:

      # inside the for-loop:
      torch.save(model.state_dict(), 'train_valid_exp4-epoch{}.pth'.format(epoch)) 

You may also consider every 10 epochs, instead of every epoch if the model takes too much space:

      # inside the for-loop:
      if epoch % 10 == 9:
          torch.save(model.state_dict(), 'train_valid_exp4-epoch{}.pth'.format(epoch+1)) 
3 Likes

@vmirly1 Thanks a lot. very helpful. I have to run two models again :frowning_face: by this way

oh bummer! Hope it won’t take too long to rerun them. Also, you can consider saving the state of optimizer. A good example is here: Saving and loading a model in Pytorch?

2 Likes

@vmirly1 Thank you for sharing this. How about optimiser? I assume should be save same as model inside the loop. Is that right?

Yes, this link (Saving and loading a model in Pytorch?) has an example of optimizer as well. So, basically, you create a dictionary and save the checkpoint as follows:

        save_checkpoint({
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'best_prec1': best_prec1,
            'optimizer' : optimizer.state_dict(),
        }, is_best)

Note that I just copied this from that link.

3 Likes

@vmirly1 Just I forgot to ask about loading. I could see in here loaded the epoch like this

checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

could you please give me an example how can I load the model in specific epoch? for example after first 10 epochs that saved.

2 Likes

Sure! specific model and all its parameters, including the optimizer, will al be together in the same file. So, you just need to load the corresponding file. So, if we determine variable epoch=10, then the filename as determined above will be

epoch = 10
PATH = 'train_valid_exp4-epoch{}.pth'.format(epoch)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
1 Like

@vmirly1 Thanks a lot.

Thanks for answers, may I just clarify?

this function is for saving my model

def save_checkpoint(state, is_best, filename=‘checkpoint.pth.tar’):
torch.save(state, filename)
if is_best:
shutil.copyfile(filename, ‘model_best.pth.tar’)

in each epoch(in for loop) I save my model, optimizer and scheduler

save_checkpoint({
  'epoch': epoch + 1,
  'arch': args.arch,
  'state_dict': model.state_dict(),
  'optimizer' : optimizer.state_dict(),
  'scheduler': scheduler,
}, is_best)

and I load it like this

PATH = ''checkpoint.pth.tar"
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint[‘model_state_dict’])
optimizer.load_state_dict(checkpoint[‘optimizer_state_dict’])
epoch = checkpoint[‘epoch’]
loss = checkpoint[‘loss’]

am I right? thanks)

1 Like