GPU failure if training on the primary GPU?

ming · February 25, 2017, 1:15am

Hi,

We have a machine with 4 Titan X GPUs - but we can only train Pytorch models on the 3 GPUs other than the primary one. As long as we utilize the primary GPU, pytorch would hangs after a few iterations and nvidia-smi would report GPU loss afterwards - only rebooting can recover the machine.

We have tried to uninstall X-org from Ubuntu 16 desktop, or re-install Ubuntu 16 server without X. Disconnect display etc. It always causes GPU loss if the training utilizes the primary GPU.

We also tried to use MXNET train on 4 GPUs, it goes well without seeing this problem.

Any idea why Pytorch cannot train on the primary GPU? Any workaround we could try?

Thanks!
Ming

smth · February 25, 2017, 3:28am

Hi Ming,

Try to update your NVIDIA drivers to latest version.
This is definitely an issue either related to the nvidia driver or related to hardware issue (overheating or other issue).

We also tried to use MXNET train on 4 GPUs, it goes well without seeing this problem.

This could be because when you are using MXNet, the mxnet install did not end up using cudnn or nccl (for example) and runs fine.

I would also suggest that you try installing pytorch from source or docker image, just to see if that helps your case. It can also help me debug your case: GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration

ming · February 25, 2017, 5:18am

Thank you Soumith for the prompt reply!

We are using Nvidia 375.26, CUDA 8.0 and CUDNN 5.1 which are the latest. I doubt it’s overheating issue because we can train the model using GPUs 1, 2, 3, but cannot use GPUs 0, 1, 2, where GPU 0 is the primary one.

We also see cases that on machine with single GTX 1080 GPU, training pytorch model that maximize usage of GPU memory could cause GPU failure too.

In any case, I will try installing pytorch from source and report back result.

phenom · April 11, 2017, 6:09am

Hi Ming,

Were you able to solve the problem? I am also stuck at the same problem…