I’m running a kenel that uses pytorch and CNN for mnist digit classification. The model is fairly simple but I’m still getting CUDA OUT OF MEMORY error.
Here is the link to the kernel : https://www.kaggle.com/whizzkid/training-best-cnn-model-pytorch
I think I’m using pytorch is some wrong way. Because I’ve seen more complex model getting trained on kaggle successfully. I tried same model with keras and it worked as well. Even batch size is 16
Kaggle provides 16GB gpu. That’s a lot to train my model. Still i’m getting the error I don’t know why.
Can anyone tell me what am I doing wrong ?
The link is broken.
Are you getting out of memory right at the start of training or does it train for sometime and then you get OOM ?
Also, can you check with batch_size=1 and see if the model runs ?
Sorry for the link it’s working now.
I’m getting out of memory after 5-6 epochs or sometimes 9-10 epochs.
make the following change in your code in both the places
net_loss += loss.item()
loss.item() would just sum up the loss value, where as if you add loss it would keep on adding the whole graphs.
thanks @god_sp33d , it’s working now
Although I understood the problem a bit, could you please explain in detail what was happening.
Because earlier also when I was printing the loss value, it was working the same way as of now. But now I’m not getting out of memory error
when you call .item() it returns the value of the tensor which is just a number. But in your earlier case you are adding loss which is a reference(not a value of tensor). So, you are holding that the reference to that computational graph every time you are doing total_loss += loss, which otherwise would have been destroyed after the update. To see the difference yourself,
The first prints a tensor which is the graph(as everything is connected) where as the second is just a float.
PS: If your problem is solved mark the above comment as solution so that others may find it helpful.