If I load from a checkpoint, does it iterate again over all data?

If so, that would be be bad, wouldn’t it? I mean it has already trained on the data :thinking:

Thanks for your help!

If your training is setup as for epoch in range(end_epoch), then yes, it will iterate over all the data and retrain again. If you’re using this to resume training, then just include a start_epoch value and have range(start_epoch, end_epoch), and if you only want to evaluate you can either comment out the training entirely, set the end epoch to 0 (range 0-0 is empty if I recall correctly, so there will be no iterations–don’t quote me though), or have a branching if statement and set whether you’re doing evaluation or training.

1 Like

@cpeters Great, thank you so much!
I am having difficulties applying it to my scenario. I only have two epochs but each contains 146,447 batches. So, I would need to do it on a batch level rather than epoch level. I have a train() and evaluate() function that are called like this

for epoch in range(params['epochs']):
     
      print('\n Epoch {:} / {:}'.format(epoch + 1, params['epochs']))
      
      #train model
      train_loss = train(scheduler, optimizer)
      
      #evaluate model
      valid_loss = evaluate()
      
      #save the best model -> OLD
      #this only saves the model at the end of each epoch. I want to change it.
      '''if valid_loss < best_valid_loss:
          best_valid_loss = valid_loss
          torch.save(model.state_dict(), model_file)'''
      
      # append training and validation loss
      train_losses.append(train_loss)
      valid_losses.append(valid_loss)
      
      print(f'\nTraining Loss: {train_loss:.3f}')
      print(f'Validation Loss: {valid_loss:.3f}')

I could create the checkpoint within train() but then won’t have access to valid_loss within train().

Any idea how I could resolve this? :sweat_smile:

I used to create the checkpoint within train() like this:

# checkpoint
    if step % 300 == 0 and not step == 0:
      checkpoint = {'state_dict': model.state_dict(), 
                      'optimizer': optimizer.state_dict(), 
                      'train_loss': step_loss,
                      #'val_Loss': valid_loss, 
                      }
      save_checkpoint(checkpoint)

It saves the stated params all 300 steps. This works but if I load from it, it starts from the beginning of the dataset again. So, I would need to create something like range(start_batch, end_batch)? Also, I do not have access to the validation loss within train() as mentioned before.

If you want to resume from partway through a dataset (and I’m assuming your dataset isn’t being shuffled or it’s meaningless to even try), then you just need to specify a starting point. e.g. have train(scheduler, optimizer, starting_step), initialise it to whatever you want outside of the epoch loop, and reset it to 0 at the end of the epoch. It’s a bit odd to pause and resume mid epoch, so the easiest way I can see is something like

if step <= starting_step:
    continue
else:
   # training loop goes here

If you forgot to set it to 0 at the end of an epoch, you’d only be training on a fixed data subset.

As for accessing validation loss from within another method, you could either once again pass it is an argument to train, or you could have a global variable (that would also go for starting_step).

1 Like

okay thanks, I’ll try to do so!

Yes, it’s weird indeed. I am using google colab and the runtime is interrupted after around 12 hours, which is in the middle of an epoch. Thus, I need to restart it… preferably from a checkpoint where it was interrupted

To be more specific, that if/else check would have to come inside the iterator for the dataset–i.e. so each batch is drawn and discarded. This would probably take a little while to get through because the data has to be constantly loaded but it’s obviously much, much faster than training on it.

1 Like