Leaving unknown error on CUDA


I was running a task on certain GPUs, and then wanted to run another task also on same GPUs. Then I used ctrl+C to terminate the second task, suddenly everything is stuck. When I tried to run nvidia-smi to spot the situation, it even become dreaded un-interruptible and also stuck there. When I try to kill every related process, the task seemed to be terminated, however, there was still some python process that cannot be killed and finally leaving unknown error in CUDA which occupied memory and utilized 100% usage of thoes GPU and I could not send any more jobs on them. I have encountered this twice. Is this disaster due to my misbehavior or a bug of PyTorch?

BTW, I built PyTorch with latest commit.

1 Like

I also encountered the same problem and do not know how to solve it. I do not want to reboot the server because I am not in the sudoer and other people are using it. The related issues on tensorflow and mxnet are posted on https://github.com/dmlc/mxnet/issues/4242

Exactly, I was also using a server and luckily I asked DCO to rebooted the server and solve the problem.