Hey there! Sorry if this has been asked before (I did several searches and couldn’t find the answers I’m looking for). I have a training loop where I compile a model before training using the default parameters (this model has pre-trained weights already loaded):
model = torch.compile(model)
In this training loop, I check for the training loss (training loss because this is a specific case where we do not have validation files) and do an early stopping based on the training loss (i.e., if the absolute change of the loss is below a certain threshold for a number of steps, then we finish training and output the model). Based on a flag, we return either the last trained model, or the model with the lowest training loss:
# this is done at each epoch in the training loop
if tr_loss_sum < min_loss:
min_loss = copy.copy(tr_loss_sum)
best_model = copy.deepcopy(model)
logger.debug(f"epoch: {epoch}, min_loss: {min_loss}")
# early stopping
if early_stopping and np.abs(tr_loss_sum - last_loss) < tolerance:
stopping_counter += 1
if stopping_counter >= patience:
logger.debug(f"Performing early stopping at epoch {epoch}")
torch.cuda.empty_cache()
if return_best_model:
return best_model
else:
return model
The question I have is as follows: I currently do a deepcopy
of the model. I know this works when the model is not compiled, so we do output the best model in that situation. But, when the model is compiled, I understand that even deepcopy
would fail because the model object itself points at the compiled code in a cache. I understand that compiled models can’t be saved as compiled models (PT 2.0 - Are compiled models savable), and I understand that, as of July 2023 compiled models couldn’t be cloned/copied (Clone/Copy compiled model). Is there a way to make sure I output the compiled model with the lowest training loss? Would this entail making copies of the cache? Thank you so much for your time; again, I apologize if this question has been answered anywhere else but I couldn’t find anything on the forum/documentation.