Pytorch Blocks for no reason

CDoutzen · June 8, 2022, 4:32am

I’m training a unet network using pytorch.
At first it works well. Later I found the GPU util was not completely used so i follow the instrument to load data in cpu and moving it to gpu so as to enable pin_memory option, and i also change num_workers from 0 to 4.
In this case the training blocks after 60~70 epochs. It raises no exception and does not panic. It is just like “block” for unknown reason. this doesn’t happen before i alter the configuration above.
this is the output of nvidia-smi when the training is blocked.
±--------------------------------------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------±-------------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|================+======================+==================== =|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 On | N/A |
| 41% 41C P8 10W / 225W | 2098MiB / 7981MiB | 26% Default|
| | | N/A |
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|======================================================|

| 0 N/A N/A 5151 C …/envs/pytorch/bin/python3 1429MiB |

as seen, the GPU memory still occupied by the model, but GPU-util decreases to 30% and lower, which is 90%+ when it runs normally. And the training epoch stops to output. It seems training is blocked in this epoch, but i can’t figure out what’s the problem
what kind of state is the program in this case? how can i fix it ?
thanks if anyone can help