"Out of Memory" error when restarting the training

Pulkit_Kumar · October 30, 2017, 9:37am

Greetings,

I have been facing a weird error. So i am training a certain model on a certain batch size on 1080ti GPU using pytorch 0.2.0_1, python 2.7, Ubuntu 14.04. I trained the model for a certain number of epochs, simultaously saving the model weights and the optimizer. When i tried to restart the training, keeping the batch size same and reloading the weights and the optimizer, the gpu goes “out of memory”.

Can anyone please help me out with this as to why is it happening?

SpandanMadan · October 30, 2017, 9:45am

Are you copying your model somewhere in the script? It’s possible you may be creating multiple copies of something which may be messing up stuff.

Also, code and error message are needed to help out really. There is no context without it…

Pulkit_Kumar · October 31, 2017, 11:47am

Sorry about the formatting,So the basic code is :

model_state = torch.load(“latest_model.pth”)
model.load_state_dict(state[‘state_dict’])
optimizer_state = torch.load(“latest_optim.pth”)
optimizer = torch.load(optimzer_state[“state_dict”])

for epoch in xrange(no_of_epochs):
model.train()
#basic training code
model.eval()
#basic validation code

The only diffenrece is that I have made functions for training and testing where the model and optimizer are the parameters for it. I dont think accounts for “copying the model”.

Alexpon · January 4, 2018, 3:08am

I meet the same problem.
I call model.cuda() “after” loading state_dict then solve the problem.
For your reference.