Do GPUs die? RTX 2080Ti going nuts on simple computations

ashafaei · February 2, 2019, 2:18am

I have two GPUs on my machine, one RTX 2080Ti and one 1080Ti. Last night during an experiment on RTX 2080Ti the computation failed, like when you encounter a NaN, and I’ve been having weird issues with my RTX 2080Ti ever since.

My syslog is filled with:

[ 1653.030199] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 4, SM 1): Illegal Instruction Encoding
[ 1653.030203] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 5, TPC 4, SM 1): Multiple Warp Errors

The same model with the same input yields different output depending on the GPU! Even on a simple convolution operation, the outputs can differ as much as 1e37. It seems like the issue must be from the CUDA driver and the below layers.

I’m currently running Driver Version: 410.78 CUDA Version: 10.0 with PyTorch 1, tested with Py 2.7 and 3.7.

np.random.seed(args.seed)
random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.set_device(0)
torch.cuda.manual_seed(args.seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = True

# op = Conv2d(3, 128, kernel_size=(9, 9), stride=(1, 1), padding=(4, 4))
# op.weight.norm() = 6.5080
# op.bias.norm() = 0.4507
# x.norm() = 326.2348
x_cpu = x.to('cpu')
y_cpu = op.to('cpu')(x_cpu)

x_cuda0 = x.to('cuda:0')
y_cuda0 = op.to('cuda:0')(x_cuda0).cpu()

x_cuda1 = x.to('cuda:1')
y_cuda1 = op.to('cuda:1')(x_cuda1).cpu()

(y_cpu-y_cuda0).max()
> 1.5438e+20
(y_cpu-y_cuda1).max()
> 9.5367e-07
y_cpu.norm()
> 16999.3672
y_cuda0.norm()
> inf
y_cuda1.norm() 
> 16999.3672

Has anyone ever encountered a similar issue? Any ideas on what to do next?!

ptrblck · February 2, 2019, 4:43am

I just searched for your error messaged and came across some posts in other boards and these point came up repeatedly:

reduce the clock in case you’ve overclocked your system
your PSU might not have enough power for the card
install latest drivers

I don’t think this error is PyTorch related. Sorry for not being able to get more input, but I think you might want to post this issue in some NVIDIA support board.

ashafaei · February 2, 2019, 6:00am

Thank you for looking into this. I did search a lot before posting here. The closest thing I found was that some of the early RTX 2080Ti releases were defective. But in almost all the cases, heating up had lead to an error. It is definitely not PyTorch; most likely the driver or the hardware below it. I figured if anywhere, this community is most likely to see these issues, now, or in the future.

In my situation, these errors don’t happen midst of an experiment when the card is operating at max capacity. It doesn’t heat up; I can’t even get past the first iteration of the training on the RTX 2080Ti anymore. The PSU is 1000W which I believe is more than necessary. No manual overclocking either, not sure if the manufacturer had done anything though.

I’ll update the driver to 415 that is a beta to see if it will fix the issue. I’ll post here if I had any luck.

Thanks.

halahup · February 7, 2019, 3:28pm

Hi, yeah we encountered the same issue a few days ago where one of our two RTX2080Ti was blowing up the activations and the gradients without any reason, resulting in inf and nan all over the place. The other card works just fine. Here is the thread we were running: https://discuss.pytorch.org/t/different-losses-on-2-different-machines/.
Thank you

ashafaei · March 9, 2019, 2:07am

We ended up sending the GPU back to the manufacturer since it was under a warranty. I think the overheating had damaged the memory – we did not overclock the card. When I connected the GPU to a display, I could see several artifacts. The manufacturer sent us a new GPU after two weeks.

Apparently, if you’re using multiple GPUs, especially the RTX 2080Tis, you should be using the blower fan type. See this.