I am having this weird issue where in between an epoch training , the model is not making any progress and hangs in the middle, even the time running seems to halt. My GPU temperature subsides but the nvidia-smi output still shows the model is still there in the GPU (as the memory which would be 3GB ) reamins the same (a resnet32 model)
And surprisingly it starts to resume training when I press any KEY.
have you guys faced any issue like this?
I have tried a lot with changing the num_workers from the torch.utils.data.DataLoader and it doesn’t seem to work anything
Ok tested with the above two steps and both of them doesn’t seem to work . I am guessing my CPU and GPU are throttling and then it stalls the process causing the hang?
Below is the nvdia-smi output when its stalled :
Are you highlighting something in the CMD window by either dragging on it or using Right Menu-Edit-Mark? If yes, then the OS will pause the execution of the program until you restore it to the normal state by either pressing any key or using Right Menu-Edit-Copy.
No, I am not doing anything actually. This nvdia-smi output was there on another cmd prompt. This behavior happens sometimes in validation part and sometimes in training part. The observed behavior is that initially in an epoch which takes 25-30 min it might not happen, but after happening for the first time it keeps on repeating very often.
I think I just fixed the problem after careful analyzing of the code. I had not inherited data.Dataset from torch.utils into my custom dataset. I am unsure how this might cause a problem but it seems to fix my problem atleast for now. I didn’t see any random hangs in the middle.