While I use the distributed training, the training process encounters clueless halt. The GPU-Memory and GPU-Util (90%-100%) of all GPUs are normal with no pid being killed!!!
In my case, it shows like a 100% GPU utility occupation there on the GPU. And some cores of CPU are taken as well but it’s weird that actually nothing is ongoing and the training program seems just like “dead”. We use the same repo here: TorchCV.
I tried to stop the program with Ctrl+C or simply kill the program. I was expecting some error info but nothing was printed.
Does anyone encounter this kind of problem before? It would be also great if anybody has some techniques to localize the problem or at least print something.