Compile and deepcopy in the context of early stopping

hectordomorozco · April 16, 2024, 10:36pm

Hey there! Sorry if this has been asked before (I did several searches and couldn’t find the answers I’m looking for). I have a training loop where I compile a model before training using the default parameters (this model has pre-trained weights already loaded):

model = torch.compile(model)

In this training loop, I check for the training loss (training loss because this is a specific case where we do not have validation files) and do an early stopping based on the training loss (i.e., if the absolute change of the loss is below a certain threshold for a number of steps, then we finish training and output the model). Based on a flag, we return either the last trained model, or the model with the lowest training loss:

# this is done at each epoch in the training loop
  if tr_loss_sum < min_loss:
      min_loss = copy.copy(tr_loss_sum)
      best_model = copy.deepcopy(model)
      logger.debug(f"epoch: {epoch}, min_loss: {min_loss}")

  # early stopping
  if early_stopping and np.abs(tr_loss_sum - last_loss) < tolerance:
      stopping_counter += 1
      if stopping_counter >= patience:
          logger.debug(f"Performing early stopping at epoch {epoch}")
          torch.cuda.empty_cache()
          if return_best_model:
              return best_model
          else:
              return model

The question I have is as follows: I currently do a deepcopy of the model. I know this works when the model is not compiled, so we do output the best model in that situation. But, when the model is compiled, I understand that even deepcopy would fail because the model object itself points at the compiled code in a cache. I understand that compiled models can’t be saved as compiled models (PT 2.0 - Are compiled models savable), and I understand that, as of July 2023 compiled models couldn’t be cloned/copied (Clone/Copy compiled model). Is there a way to make sure I output the compiled model with the lowest training loss? Would this entail making copies of the cache? Thank you so much for your time; again, I apologize if this question has been answered anywhere else but I couldn’t find anything on the forum/documentation.