We have a machine with 4 Titan X GPUs - but we can only train Pytorch models on the 3 GPUs other than the primary one. As long as we utilize the primary GPU, pytorch would hangs after a few iterations and nvidia-smi would report GPU loss afterwards - only rebooting can recover the machine.
We have tried to uninstall X-org from Ubuntu 16 desktop, or re-install Ubuntu 16 server without X. Disconnect display etc. It always causes GPU loss if the training utilizes the primary GPU.
We also tried to use MXNET train on 4 GPUs, it goes well without seeing this problem.
Any idea why Pytorch cannot train on the primary GPU? Any workaround we could try?
Try to update your NVIDIA drivers to latest version.
This is definitely an issue either related to the nvidia driver or related to hardware issue (overheating or other issue).
We also tried to use MXNET train on 4 GPUs, it goes well without seeing this problem.
This could be because when you are using MXNet, the mxnet install did not end up using cudnn or nccl (for example) and runs fine.
We are using Nvidia 375.26, CUDA 8.0 and CUDNN 5.1 which are the latest. I doubt it’s overheating issue because we can train the model using GPUs 1, 2, 3, but cannot use GPUs 0, 1, 2, where GPU 0 is the primary one.
We also see cases that on machine with single GTX 1080 GPU, training pytorch model that maximize usage of GPU memory could cause GPU failure too.
In any case, I will try installing pytorch from source and report back result.