So I know my GPU is close to be out of memory with this training, and that’s why I only use a batch size of two and it seems to work alright.
The problem arises when I first load the existing model using torch.load, and then resume training. When resuming training, it instantly says :
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 1.96 GiB total capacity; 1.36 GiB already allocated; 46.75 MiB free; 38.25 MiB cached)
I don’t know how to get rid of this error. When using nvidia-smi right after the error, I obtain this :
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 991 G /usr/lib/xorg/Xorg 120MiB |
| 0 1350 G cinnamon 29MiB |
| 0 2878 G ...incent/anaconda3/envs/newtor/bin/python 2MiB |
+-----------------------------------------------------------------------------+
So it’s definitely something about loading the model that makes it break. Can you help me out ? Maybe hack it and reset Cuda memory usage after loading the model ? Thanks for helping me out.