I noticed that 99% of the GPU RAM is still being used after and no process is listed by nvidia-smi after PyTorch aborts with an
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCStorage.see you:58
One could run sudo nvidia-smi --gpu-reset -i 0 as a remedy, however, it would be nice if PyTorch could free the memory after such a “crash” – not sure if this is only a problem in my particular case, so I wanted to ask if others have the same issue (before I post a GitHub issue)
EDIT 1: I should mention that the memory is freed if I e.g., interrupt or kill a process, it’s just the RuntimeError: cuda runtime error (2) : out of memory which doesn’t seem to free the memory
EDIT 2: It does not seem to be just a reporting issue by nvidia-smi as I can’t run even smaller networks on that card after I encountered that issue
I honestly don’t think it’s cause of the GPU. I’ve tried it on a GTX 970m and a Titan XP and it works fine.
The only other difference from your setup is that I installed using pip on a virtual environment.
Hm, that’s interesting! There are 8 Tesla K80’s in that rack, and I just checked, the two ones where I “crashed” the PyTorch code (via RuntimeError: cuda runtime error (2) : out of memory) still occupy that memory (GPU 1 & 2, GPU 7 is the one I am currently using)
Yes that’s correct. And it’s not only nvidia-smi showing this (e.g., because of some caching issue) but the card really holds that memory (I can’t run any code on these cards anymore if they are not reset)
so, I remember this issue. When investigating, we found that there’s actually a bug in python multiprocessing that might keep the child process hanging around, as zombie processes.
It is not even visible to nvidia-smi.
The solution is killall python, or to ps -elf | grep python and find them and kill -9 [pid] to them.
Solved my problem, thanks! Do you know if that’s a known bug listed on the Python bugtracker? Otherwise it might be worth adding it. Or maybe as a warning to users, it might be useful to modify the Runtime error message to warn users about that if possible?
Thank you so much for sharing the information. I actually came across this bug today when I terminated a PyTorch program which used a multi-threaded data loader. After killing the program, I tried to rerun it again but kept seeing CUDA OOM error. Running nvidia-smi showed everything was fine. And cudaMemGetInfo also did not expose the problem. For more details, please refer to this reply.
Now I think I understand why it “worked” that way.
I have the same proble.
first, I open a python shell, type import torch
this time I open another ssh type watch nvidia-smi
second I return to first python shell, create a tensor(27,3,480,270) and move it to cuda input = torch.rand(27,3,480,270).cuda()
the page of nvidia-smi change, and cuda memory increase
third, use ctrl+Z to quit python shell.
The cuda memory is not auto-free. The nvidia-smi page indicate the memory is still using.
The solution is you can use kill -9 <pid> to kill and free the cuda memory by hand.
I use Ubuntu 1604, python 3.5, pytorch 1.0.
Although the problem solved, it`s uncomfortable that the cuda memory can not automatically free