RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 46.44 MiB free; 10.63 GiB reserved in total by PyTorch)

Hi everyone,
Im on this issue for a long time and tried out all possible solutions found online.
I created a new aws instance. While, training a pretrained Pegasus Pytorch model on CUDA on it, I get this error within seconds when I run the command.
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 46.44 MiB free; 10.63 GiB reserved in total by PyTorch)
The following are the things I tried but didn’t worked:

  1. torch.cuda.empty_cache()
  2. gc.collect() to remove unsed variables.
  3. Reset the GPU with nvidia-smi gpu reset
  4. Rebooting the instance.
  5. Reducing the batch size to 5 (news articles in this case)
  6. Interestingly, the nvidia-smi gives no running process on GPU. Thus, there is nothing to kill.
    Please, let me know if anyone has got an idea to tackle this issue.

As the error message claims, you are running out of memory and would need to reduce the memory usage in your script by e.g. reducing the batch size or using torch.utils.checkpoint to trade compute for memory.
The not reported allocated memory would be used by the CUDA context. If nvidia-smi doesn’t report any processes, it might be a permission issue, but the GPU usage should nevertheless be shown.

One more thing you can try: del variable once variable is not required. Examples: batch data at the end of each training loop, models once they are not needed anymore, datasets which are no longer used, and so on.