I have long faced this problem, but never investigated it until now.
It seems that there are processes with sequential IDs spawned which persist. They don’t show up in nvidia-smi, and neither in top, but if I do a fuser /dev/nvidia* processes with sequential IDs show up.
The number of processes is equal to num_workers in the PyTorch data loader.
It becomes a big problem because every once in a while, these processes end up in an S state, that is ‘interruptible sleep’, or ‘uninterruptible sleep’, at which point we have to reboot our system. Which is a BIG problem for research clusters.
kill -9 doesn’t kill if they’ve gone into the interrupted sleep state. It can only kill processes if they’re not in S,D,Z states (look up linux process states).