When training FCN, GPU is lost

When I train FCN, after tens of iterations(multi-GPU), command nvidia-smi print “GPU is lost”. I wonder this problem never appeared when I use Tensorflow and caffe in the same GPU server. Besides, GPU is lost when the batch size equals to 16, if less than 16, for example it equals to 4, the phenomenon would not happen.

So, I think it’s some bug in Pytorch rather than my GPUs.

Hi, did you solve the issue? I get same problem when I’m using 4 gpu at once.
Either 2 model using 2 gpu each and 1model using 4 gpu cause “GPU is lost”
It’s very tricky. It is fine, when I’m using only 2 gpu at once.

Please share your experience about how to deal with your problem.