I trained pytorch example resnet18 on imagenet, after about 1 epoch the training hangs and nvidia-smi says GPU lost …
nvidia-smi -l 1 says:
Unable to determine the device handle for GPU 0000:09:00.0: GPU is lost. Reboot the system to recover this GPU