I have long faced this problem, but never investigated it until now.
It seems that there are processes with sequential IDs spawned which persist. They don’t show up in
nvidia-smi, and neither in
top, but if I do a
fuser /dev/nvidia* processes with sequential IDs show up.
The number of processes is equal to num_workers in the PyTorch data loader.
It becomes a big problem because every once in a while, these processes end up in an S state, that is ‘interruptible sleep’, or ‘uninterruptible sleep’, at which point we have to reboot our system. Which is a BIG problem for research clusters.