@rabst
so, I remember this issue. When investigating, we found that there’s actually a bug in python multiprocessing that might keep the child process hanging around, as zombie processes.
It is not even visible to nvidia-smi.
The solution is killall python, or to ps -elf | grep python and find them and kill -9 [pid] to them.