RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 46.44 MiB free; 10.63 GiB reserved in total by PyTorch)

Niranj_Jyothish · August 4, 2021, 9:15pm

Hi everyone,
Im on this issue for a long time and tried out all possible solutions found online.
I created a new aws instance. While, training a pretrained Pegasus Pytorch model on CUDA on it, I get this error within seconds when I run the command.
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 46.44 MiB free; 10.63 GiB reserved in total by PyTorch)
The following are the things I tried but didn’t worked:

torch.cuda.empty_cache()
gc.collect() to remove unsed variables.
Reset the GPU with nvidia-smi gpu reset
Rebooting the instance.
Reducing the batch size to 5 (news articles in this case)
Interestingly, the nvidia-smi gives no running process on GPU. Thus, there is nothing to kill.
Please, let me know if anyone has got an idea to tackle this issue.

ptrblck · August 10, 2021, 4:34am

As the error message claims, you are running out of memory and would need to reduce the memory usage in your script by e.g. reducing the batch size or using torch.utils.checkpoint to trade compute for memory.
The not reported allocated memory would be used by the CUDA context. If nvidia-smi doesn’t report any processes, it might be a permission issue, but the GPU usage should nevertheless be shown.

gphilip · August 10, 2021, 6:37am

One more thing you can try: del variable once variable is not required. Examples: batch data at the end of each training loop, models once they are not needed anymore, datasets which are no longer used, and so on.