RuntimeError: CUDA error: out of memory | return input.log_softmax(dim)

Hi,

I started training the following model, using the train_mtl.py:

After I entered training epoch 45 (out of 120), I received the following error:

Traceback (most recent call last):
  File "train_mtl.py", line 472, in <module>
    main()
  File "train_mtl.py", line 466, in main
    val_loss = test(epoch)
  File "train_mtl.py", line 367, in test
    loss2 = angle_loss(pred_vecmaps, util.to_variable(labels[0], True))
  File "C:\Users\...\road_connectivity-master\venv\lib\site-packages\torch\nn\modules\module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\...\road_connectivity-master\utils\loss.py", line 15, in forward
    log_p = F.log_softmax(inputs, dim=1)
  File "C:\Users\...\road_connectivity-master\venv\lib\site-packages\torch\nn\functional.py", line 975, in log_softmax
    return input.log_softmax(dim)
RuntimeError: CUDA error: out of memory

I am running PyTorch 0.4.1 on Windows 10 with NVIDIA Quadro P3200 with Max-Q Design.

Any ideas on how to solve this?

Hi @ntelo007,

I think GPU (NVIDIA Quadro P3200) has memory of 6GB and in my training code I used GPU with 12 GB of RAM. So, I suggest you to please lower the batch size (probably half it) to train the code.

Thanks

That means that I should change the batch size to 8 (initial 16) and start training from the beginning right? Is it possible to continue training with the best so far model, using a different batch size?

Hi @ntelo007,

Yes, it is possible to resume the training from the best saved model.

I am confused now, if you were able to train and save the model then I am wondering where OOM issue occurred. Can you post the full logs? Is it happening at testing time?

Thanks

Yes

Which files are you requesting exactly?

Now, it make sense. There was a small issue which was causing to increase the memory. I have updated my code (train_mtl.py and util.py). Request you to take latest and run it again from scratch.

Thanks

I replaced the files but now I am receiving many errors and I cannot train your model. What did you change exactly? It seems that you even removed the main function.