@rabst
so, I remember this issue. When investigating, we found that there’s actually a bug in python multiprocessing that might keep the child process hanging around, as zombie processes.
It is not even visible to nvidia-smi
.
The solution is killall python
, or to ps -elf | grep python
and find them and kill -9 [pid]
to them.