I use pretrained Densenet121 to test several datasets, but after running several epochs, the program hangs on and there isn’t any error message reported. The process can be found in host and the memory of GPU is still occupied by the process but the Volatile GPU-Util is 0%
as the follow shows:
Are you using multiprocessing? If so, could you disable it (e.g. by setting num_workers=0 in DataLoader) just for debugging?
Is the process hanging in the first epoch or after a few?
Thanks for your kind reply! I just use the pretrained model to test some samples and the num_workers is 0 by default in my DataLoader as the code shows. It hangs after several samples.
Hi, I’ve kept my program running on CPU for more than 30 hours, and it didn’t hangs again. After that, i run it on GPU again and it also didn’t hangs again. A detail I’d like to mention is that when i posted this question shortly after, my machine automatically restarted, which i think helps to solve this problem.
I also get a runtime error occasionally when I use a shell script to keep running a group pytorch-based program. The error is as the follow shows:
cuda runtime error (4) : unspecified launch failure at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:257
But when the error appears ,the program just shutdowns and does’t make the program hangs. This may be another problem though, I just to give more information about the hanging since there might be some latent relations.