Pytorch CNN - cuda out of memory

ASHUTOSH_CHANDRA · June 25, 2019, 5:58am

I’m running a kenel that uses pytorch and CNN for mnist digit classification. The model is fairly simple but I’m still getting CUDA OUT OF MEMORY error.

Here is the link to the kernel : https://www.kaggle.com/whizzkid/training-best-cnn-model-pytorch

I think I’m using pytorch is some wrong way. Because I’ve seen more complex model getting trained on kaggle successfully. I tried same model with keras and it worked as well. Even batch size is 16

Kaggle provides 16GB gpu. That’s a lot to train my model. Still i’m getting the error I don’t know why.

Can anyone tell me what am I doing wrong ?

god_sp33d · June 25, 2019, 6:06am

The link is broken.

Are you getting out of memory right at the start of training or does it train for sometime and then you get OOM ?

Also, can you check with batch_size=1 and see if the model runs ?

ASHUTOSH_CHANDRA · June 25, 2019, 6:39am

Sorry for the link it’s working now.

I’m getting out of memory after 5-6 epochs or sometimes 9-10 epochs.

god_sp33d · June 25, 2019, 6:43am

make the following change in your code in both the places

net_loss += loss.item()

loss.item() would just sum up the loss value, where as if you add loss it would keep on adding the whole graphs.

ASHUTOSH_CHANDRA · June 25, 2019, 7:19am

thanks @god_sp33d , it’s working now

Although I understood the problem a bit, could you please explain in detail what was happening.

Because earlier also when I was printing the loss value, it was working the same way as of now. But now I’m not getting out of memory error

god_sp33d · June 25, 2019, 7:46am

when you call .item() it returns the value of the tensor which is just a number. But in your earlier case you are adding loss which is a reference(not a value of tensor). So, you are holding that the reference to that computational graph every time you are doing total_loss += loss, which otherwise would have been destroyed after the update. To see the difference yourself,

print(loss)
print(loss.item())

print(type(loss))
print(type(loss.item()))

The first prints a tensor which is the graph(as everything is connected) where as the second is just a float.

PS: If your problem is solved mark the above comment as solution so that others may find it helpful.