CUDA out of memory in the middle of training

Stanley_Sie · January 23, 2023, 12:53am

I was already training it until epoch 8 without any problem the night before, and I turned off my laptop for the night. Then, when I returned in the morning to continue the training, I got the following error.

CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 6.00 GiB total capacity; 3.24 GiB already allocated; 0 bytes free; 5.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I was using a batch size of 20, and when I reduce the batch size to 16, it runs properly. But it comes as weird to me that last night it was able to run without any issue with the exact same configuration. I’m also a bit confused with the allocated and reserved information, because shouldn’t the allocated be at least close to the total reserved if it really consumes all the memory in my GPU? I’m pretty sure I have no other program or application running on my GPU as well. Could you help explain what may cause this?

As much as possible, I don’t want to change the batch size and start training again from scratch

ptrblck · January 23, 2023, 4:24am

Yes, assuming your script is using almost all memory while the cache is nearly empty, which doesn’t seem to be the case here.
It’s strange that the same code is suddenly not running anymore. Are you able to reproduce this OOM error constantly now at the same epoch?
Also, did something change e.g. the input shape, the dataset etc.?

Stanley_Sie · January 23, 2023, 6:18am

I tried running it again from scratch with the same configuration, and I got stuck with the same error. Before it happened, I was actually able to train it until epoch 9. However, when I resumed the training and loaded the saved models gave me some errors that shows the possibility of the saved files being corrupted, and hence, I started back from the saved model at epoch 8. So, to be able to continue the training, it’s either I revised the model a bit to make it smaller, or I reduced the batch size, or the dataset.

There were no changes made in any way. I encountered the same issue a few months ago and decided to just rerun it from scratch, and today, I encountered the same thing again. I’m not sure if this may potentially be the cause of the issue, but I’m currently training on WSL2. My laptop has Windows 11 OS, but based on my experience, it’s a bit stressful to replicate models in a Windows environment. That’s why I’ve been using WSL2 ever since, and it has been working well for me. It may not likely be an issue in here, but just in case it may provide you with some important information/details.

ammar_naich · September 23, 2023, 7:58am

I also have encountered the same issue before, in my case, I was using a dynamic operation that was changing the feature map (input dimensions) throughout based on the initial input feature. After imposing the fixed structure on the feature map it worked well.