Performance changes when the model is loaded on another machine with a different GPU

Hello everyone,

I have a question about loading a saved model.
I trained a model on a server with 8 GPU using nn.Dataparallel I got a performance of 78.07% accuracy on the validation set. But I reload the saved model on another server with 2 GPU (still using Dataparallel) to run the validation script (of course with the same dataset), and the performance decreases to 76.86% accuracy. I am wondering why.

Note: When I go back to the server where the model has been trained everything works normally, I found exactly the same performance by running the validation script.

I would like to know if I did something wrong either in saving the model or in re-loading the saved-model.
If you have any advice that could help will be welcome.

To save the model I used:

torch.save({
             'epoch':epoch,
             'model_state_dict': model.state_dict(),
             'optimizer_state_dict':optimizer.state_dict(),}, β€˜./save_model_folder/model.tar')

To load the saved model I used:

checkpoint = torch.load(β€˜./save_model_folder/model.tar’)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

Thank you in advance.

Are you adjusting the global batch size to account for the smaller number of GPUs? The numerics of your program will change if you change the number of GPUs and nothing else

Yes, I did. I think there is something related to the server that I used, because when I evaluated the model on a third server (another one, different of 2 others) , I found the right results.

1 Like