The program hangs on after running for some time

I use pretrained Densenet121 to test several datasets, but after running several epochs, the program hangs on and there isn’t any error message reported. The process can be found in host and the memory of GPU is still occupied by the process but the Volatile GPU-Util is 0%
as the follow shows:

what’s the problem? Can anyone help to shed light on this ?

The code:

    for batch_no in range(9,24):
        test_data = datasetMutiIndx(raw_test_data, random_samples[start:start + batch_size])
        complete_data = exclude_wrong_labeled(model, test_data, device)
        test_data_laoder = DataLoader(dataset=complete_data, batch_size=1, shuffle=True)
        deepfool = DeepFool(target_model=model, num_classes=num_out, overshoot=overshoot, max_iter=max_iter,
                            device=device)
        for data, label in test_data_laoder:
            data = data.squeeze(0)
            data, label = data.to(device), label.to(device)
            adv_img, normal_label, adv_label = deepfool.do_craft(data)

Are you using multiprocessing? If so, could you disable it (e.g. by setting num_workers=0 in DataLoader) just for debugging?
Is the process hanging in the first epoch or after a few?

1 Like

Thanks for your kind reply! I just use the pretrained model to test some samples and the num_workers is 0 by default in my DataLoader as the code shows. It hangs after several samples.

Thanks for the info. Could you try to run it on your CPU and see, if it hangs again?
Are you using multiprocessing somewhere else?

Could you post an executable code snippet reproducing the error?

1 Like

Good idea. I’ll test it on CPU and it may be take a long time since the complicate model. When i get result, I’ll reply you immediately!

Hi, I’ve kept my program running on CPU for more than 30 hours, and it didn’t hangs again. After that, i run it on GPU again and it also didn’t hangs again. A detail I’d like to mention is that when i posted this question shortly after, my machine automatically restarted, which i think helps to solve this problem.

I also get a runtime error occasionally when I use a shell script to keep running a group pytorch-based program. The error is as the follow shows:

cuda runtime error (4) : unspecified launch failure at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:257

But when the error appears ,the program just shutdowns and does’t make the program hangs. This may be another problem though, I just to give more information about the hanging since there might be some latent relations.