Model training hangs in the middle

I am having this weird issue where in between an epoch training , the model is not making any progress and hangs in the middle, even the time running seems to halt. My GPU temperature subsides but the nvidia-smi output still shows the model is still there in the GPU (as the memory which would be 3GB ) reamins the same (a resnet32 model)
And surprisingly it starts to resume training when I press any KEY.
have you guys faced any issue like this?

I have tried a lot with changing the num_workers from the torch.utils.data.DataLoader and it doesn’t seem to work anything

Hi,

Are you sure you did not add a input() to your python code (or raw_input())?

Nope. I feel this is connected to this issue in windows : https://serverfault.com/questions/314194/why-does-windows-command-prompt-stalls-until-a-key-is-pressed-when-executing-lon. I have implemented the solution given in that. Also I am updating the pytorch to latest and will check whether this is happening again.

I had encountered a similar issue while installing python packages using pip, I think this might be an issue with the OS.

Ok tested with the above two steps and both of them doesn’t seem to work . I am guessing my CPU and GPU are throttling and then it stalls the process causing the hang?
Below is the nvdia-smi output when its stalled :

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070   WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   73C    P2    39W /  N/A |   3232MiB /  8192MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8820    C+G   ...osoft.LockApp_cw5n1h2txyewy\LockApp.exe N/A      |
|    0      9412    C+G   Insufficient Permissions                   N/A      |
|    0      9784      C   ...\admin\Miniconda3\envs\torch\python.exe N/A      |
|    0     10956    C+G   C:\Windows\explorer.exe                    N/A      |
|    0     11460    C+G   ...6)\Google\Chrome\Application\chrome.exe N/A      |
|    0     12316    C+G   ...cal\Programs\Microsoft VS Code\Code.exe N/A      |
|    0     13632    C+G   ...hell.Experiences.TextInput.InputApp.exe N/A      |
|    0     16728    C+G   ...oftEdge_8wekyb3d8bbwe\MicrosoftEdge.exe N/A      |
|    0     17108    C+G   ...DIA GeForce Experience\NVIDIA Share.exe N/A      |
|    0     17696    C+G   Insufficient Permissions                   N/A      |
|    0     17884    C+G   ....95.0_x64__kzf8qxf38zg5c\Skype4Life.exe N/A      |
|    0     19344    C+G   ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0     19508    C+G   ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
|    0     21836    C+G   Insufficient Permissions                   N/A      |
|    0     24484    C+G   ...x64__8wekyb3d8bbwe\Microsoft.Photos.exe N/A      |
+-----------------------------------------------------------------------------+

and when I press any key event :

| NVIDIA-SMI 436.48       Driver Version: 436.48       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070   WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   90C    P2    58W /  N/A |   3244MiB /  8192MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8820    C+G   ...osoft.LockApp_cw5n1h2txyewy\LockApp.exe N/A      |
|    0      9412    C+G   Insufficient Permissions                   N/A      |
|    0      9784      C   ...\admin\Miniconda3\envs\torch\python.exe N/A      |
|    0     10956    C+G   C:\Windows\explorer.exe                    N/A      |
|    0     11460    C+G   ...6)\Google\Chrome\Application\chrome.exe N/A      |
|    0     12316    C+G   ...cal\Programs\Microsoft VS Code\Code.exe N/A      |
|    0     13632    C+G   ...hell.Experiences.TextInput.InputApp.exe N/A      |
|    0     16728    C+G   ...oftEdge_8wekyb3d8bbwe\MicrosoftEdge.exe N/A      |
|    0     17108    C+G   ...DIA GeForce Experience\NVIDIA Share.exe N/A      |
|    0     17696    C+G   Insufficient Permissions                   N/A      |
|    0     17884    C+G   ....95.0_x64__kzf8qxf38zg5c\Skype4Life.exe N/A      |
|    0     19344    C+G   ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0     19508    C+G   ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
|    0     21836    C+G   Insufficient Permissions                   N/A      |
|    0     24484    C+G   ...x64__8wekyb3d8bbwe\Microsoft.Photos.exe N/A      |
+-----------------------------------------------------------------------------+

Are you highlighting something in the CMD window by either dragging on it or using Right Menu-Edit-Mark? If yes, then the OS will pause the execution of the program until you restore it to the normal state by either pressing any key or using Right Menu-Edit-Copy.

No, I am not doing anything actually. This nvdia-smi output was there on another cmd prompt. This behavior happens sometimes in validation part and sometimes in training part. The observed behavior is that initially in an epoch which takes 25-30 min it might not happen, but after happening for the first time it keeps on repeating very often.

I think I just fixed the problem after careful analyzing of the code. I had not inherited data.Dataset from torch.utils into my custom dataset. I am unsure how this might cause a problem but it seems to fix my problem atleast for now. I didn’t see any random hangs in the middle.

1 Like