GPU HANGS after some epoch during training

I am training my medical imaging model in ResNet50. It was working good before, and trained full till 300 epoch. But, it has been 2 days, the program stops training on 1st epoch after 160 iteration and hangs without any error. the process does not get killed. I then have to reboot the server to make it run the next time.

What might be the reason for this?

nvidia-smi also gives thos error after the server hangs
“Unable to determine the device handle for GPU 0000:65:00.0: GPU is lost. Reboot the system to recover this GPU”

The hang is most likely created by the lost GPU. You could check dmesg for xid codes and check why the GPU was dropped. XID 79 e.g. could indicate a thermal issue, which would disable your GPU before it can be damaged by overheating or an insufficient PSU.