I just had a memory leak that persisted with 90% of the GPU memory occupied AFTER I closed python and did killall python + killall jupyter for good measure. I also checked nvidia-smi and saw the 90% GPU memory usage but no corresponding processes actually using it…
How are things like this possible? How do I prevent them? And how do I diagnose them?
The only recourse I had that I knew of at the time was to reboot. But I really need some context to this problem…
P.S. I just got this to happen again by interrupting a jupyter notebook cell during the training process. So that may be related. But still I’m not sure how to deal with these issues in the long run.
Interrupting a process is often the root cause of a failed exit behavior. In my experience I haven’t seen dead processes often when working directly in the terminal (unless the code crashes) and have seen it more often in Jupyter (but I’m also not a heavy Jupyter user so others might have a different experience).
Thanks for the reply! Interesting choice in IDEs, I’ve been leaning that way too.
But regarding killing the process I already tried killall python & killall jupyter do you think it is some other process then?
Try passing the -9 flag to killall, so that it sends the stronger SIGKILL which cannot be blocked. E.g. killall -9 jupyter.
Killing all python processes may not be advisable, since you may end up disrupting non-ML programs. The better idea is to find the PIDs of your ML python programs using ps -u $USER and killing only the right ones using kill -9 <PID of the python process that you want to kill>.