PyTorch doesn't free GPU's memory of it gets aborted due to out-of-memory error

@rabst
so, I remember this issue. When investigating, we found that there’s actually a bug in python multiprocessing that might keep the child process hanging around, as zombie processes.
It is not even visible to nvidia-smi.

The solution is killall python, or to ps -elf | grep python and find them and kill -9 [pid] to them.

10 Likes