Hi everyone,
Im on this issue for a long time and tried out all possible solutions found online.
I created a new aws instance. While, training a pretrained Pegasus Pytorch model on CUDA on it, I get this error within seconds when I run the command.
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 46.44 MiB free; 10.63 GiB reserved in total by PyTorch)
The following are the things I tried but didn’t worked:
- torch.cuda.empty_cache()
- gc.collect() to remove unsed variables.
- Reset the GPU with nvidia-smi gpu reset
- Rebooting the instance.
- Reducing the batch size to 5 (news articles in this case)
- Interestingly, the nvidia-smi gives no running process on GPU. Thus, there is nothing to kill.
Please, let me know if anyone has got an idea to tackle this issue.