Dataloader threads persist if loading fails

SpandanMadan · October 9, 2018, 4:20am

I have long faced this problem, but never investigated it until now.

It seems that there are processes with sequential IDs spawned which persist. They don’t show up in nvidia-smi, and neither in top, but if I do a fuser /dev/nvidia* processes with sequential IDs show up.

The number of processes is equal to num_workers in the PyTorch data loader.

It becomes a big problem because every once in a while, these processes end up in an S state, that is ‘interruptible sleep’, or ‘uninterruptible sleep’, at which point we have to reboot our system. Which is a BIG problem for research clusters.

Any leads?

SimonW · October 9, 2018, 3:16pm

https://github.com/pytorch/pytorch/pull/11985 should fix this. However, you can always kill -9 those without needing to reboot.

SimonW · October 9, 2018, 5:15pm

The patch is merged! If you download the nighly later today, it should have the patch.

SpandanMadan · October 9, 2018, 5:57pm

kill -9 doesn’t kill if they’ve gone into the interrupted sleep state. It can only kill processes if they’re not in S,D,Z states (look up linux process states).

I’ll update though and see if it happens again!

SimonW · October 10, 2018, 5:08am

You are absolutely right. Although I am curious, why are your dataloader workers using CUDA>