I am training my medical imaging model in ResNet50. It was working good before, and trained full till 300 epoch. But, it has been 2 days, the program stops training on 1st epoch after 160 iteration and hangs without any error. the process does not get killed. I then have to reboot the server to make it run the next time.
What might be the reason for this?
nvidia-smi also gives thos error after the server hangs
“Unable to determine the device handle for GPU 0000:65:00.0: GPU is lost. Reboot the system to recover this GPU”