PyTorch causes zombie processes on multi-GPU system

ATAboukhadra · December 23, 2021, 12:45pm

Hi,

I’m running my PyTorch code on a GPU on a remote server. The server has 8 GPUs and they are shared among multiple users. So I run my code on 1 GPU out of them using DataParallel class. Whenever I try to terminate the process it turns into a zombie process as you can see in the nvidia-smi output here:

All processes with a dash as the process name are killed processes that are still occupying the GPU and can’t be removed unless the system is rebooted which affects all users of the system.

I can’t produce this problem on my machine that has only 1 GPU and the code runs normally with no problems and terminates properly. This problem occurred on the multi-GPU system just last week and I’m not sure whether this was caused by a change in the PyTorch library or is a problem from my side.

If you can give me a fix or check what the problem might be it would be great.

Kind regards,
Ahmed

VFAndy · January 5, 2022, 12:05pm

Running into the same issue here, Pytorch creates Deadlocking processes (I debugged into it, let me know if you want details where the dealock happens – seems to be a threading issues imo) which become zombies and use up a lot of GPU RAM while at the same time blocking any further or current GPU jobs. I will try adding the details (Pytorch version, cuda version etc.) later.