When I train FCN, after tens of iterations(multi-GPU), command nvidia-smi print “GPU is lost”. I wonder this problem never appeared when I use Tensorflow and caffe in the same GPU server. Besides, GPU is lost when the batch size equals to 16, if less than 16, for example it equals to 4, the phenomenon would not happen.
So, I think it’s some bug in Pytorch rather than my GPUs.