PyTorch 0.4.1 random reboot K80 on Linux

MPWARE · August 20, 2018, 6:17pm

Hi,

I’m facing a similar issue already reported here by another user. I’m running PyTorch to train a ResNet model on Linux Ubuntu 16 TLS VM hosted on Google Cloud (one GPU K80). It works fine for a few minutes (several epochs) and then it reboots the host. No warning, no error, just crash with reboot. It’s quite random. Sometime it works 30 minutes without any problem.

What could be done? Do I need to update to cuDNN 7.1.3 (I’m using 7.0.5)?

Python : 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
Numpy : 1.14.2
PyTorch : 0.4.1
torch.version.cuda = 9.0.176
torch.backends.cudnn.version = 7102
torch.cuda.device_count() = 1
torch.cuda.current_device() = 0

Thanks.

MPWARE · August 20, 2018, 8:55pm

Upgrading to from cuDNN 7.0.5 to cuDNN 7.1.3 looks to fix the issue.