GPU lost in training imagenet

I trained pytorch example resnet18 on imagenet, after about 1 epoch the training hangs and nvidia-smi says GPU lost …

nvidia-smi -l 1 says:
Unable to determine the device handle for GPU 0000:09:00.0: GPU is lost. Reboot the system to recover this GPU

this is not specific to pytorch. it looks like you have either a hardware issue or a NVIDIA driver issue. I suspect hardware / thermal issue.

2 Likes