Hello,
I was running a task on certain GPUs, and then wanted to run another task also on same GPUs. Then I used ctrl+C to terminate the second task, suddenly everything is stuck. When I tried to run nvidia-smi to spot the situation, it even become dreaded un-interruptible and also stuck there. When I try to kill every related process, the task seemed to be terminated, however, there was still some python process that cannot be killed and finally leaving unknown error in CUDA which occupied memory and utilized 100% usage of thoes GPU and I could not send any more jobs on them. I have encountered this twice. Is this disaster due to my misbehavior or a bug of PyTorch?
BTW, I built PyTorch with latest commit.