Dataloader threads persist if loading fails

I have long faced this problem, but never investigated it until now.

It seems that there are processes with sequential IDs spawned which persist. They don’t show up in nvidia-smi, and neither in top, but if I do a fuser /dev/nvidia* processes with sequential IDs show up.

The number of processes is equal to num_workers in the PyTorch data loader.

It becomes a big problem because every once in a while, these processes end up in an S state, that is ‘interruptible sleep’, or ‘uninterruptible sleep’, at which point we have to reboot our system. Which is a BIG problem for research clusters.

Any leads?

https://github.com/pytorch/pytorch/pull/11985 should fix this. However, you can always kill -9 those without needing to reboot.

The patch is merged! If you download the nighly later today, it should have the patch.

kill -9 doesn’t kill if they’ve gone into the interrupted sleep state. It can only kill processes if they’re not in S,D,Z states (look up linux process states).

I’ll update though and see if it happens again!

You are absolutely right. Although I am curious, why are your dataloader workers using CUDA>