GPU HANGS after some epoch during training

shrutishrestha · June 5, 2021, 9:42am

I am training my medical imaging model in ResNet50. It was working good before, and trained full till 300 epoch. But, it has been 2 days, the program stops training on 1st epoch after 160 iteration and hangs without any error. the process does not get killed. I then have to reboot the server to make it run the next time.

What might be the reason for this?

nvidia-smi also gives thos error after the server hangs
“Unable to determine the device handle for GPU 0000:65:00.0: GPU is lost. Reboot the system to recover this GPU”

ptrblck · June 6, 2021, 11:20pm

The hang is most likely created by the lost GPU. You could check dmesg for xid codes and check why the GPU was dropped. XID 79 e.g. could indicate a thermal issue, which would disable your GPU before it can be damaged by overheating or an insufficient PSU.