I have two GPUs on my machine, one
RTX 2080Ti and one
1080Ti. Last night during an experiment on
RTX 2080Ti the computation failed, like when you encounter a NaN, and I’ve been having weird issues with my
RTX 2080Ti ever since.
My syslog is filled with:
[ 1653.030199] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 4, SM 1): Illegal Instruction Encoding
[ 1653.030203] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 5, TPC 4, SM 1): Multiple Warp Errors
The same model with the same input yields different output depending on the GPU! Even on a simple convolution operation, the outputs can differ as much as
1e37. It seems like the issue must be from the CUDA driver and the below layers.
I’m currently running Driver Version: 410.78 CUDA Version: 10.0 with PyTorch 1, tested with Py 2.7 and 3.7.
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = True
# op = Conv2d(3, 128, kernel_size=(9, 9), stride=(1, 1), padding=(4, 4))
# op.weight.norm() = 6.5080
# op.bias.norm() = 0.4507
# x.norm() = 326.2348
x_cpu = x.to('cpu')
y_cpu = op.to('cpu')(x_cpu)
x_cuda0 = x.to('cuda:0')
y_cuda0 = op.to('cuda:0')(x_cuda0).cpu()
x_cuda1 = x.to('cuda:1')
y_cuda1 = op.to('cuda:1')(x_cuda1).cpu()
Has anyone ever encountered a similar issue? Any ideas on what to do next?!
I just searched for your error messaged and came across some posts in other boards and these point came up repeatedly:
- reduce the clock in case you’ve overclocked your system
- your PSU might not have enough power for the card
- install latest drivers
I don’t think this error is PyTorch related. Sorry for not being able to get more input, but I think you might want to post this issue in some NVIDIA support board.
Thank you for looking into this. I did search a lot before posting here. The closest thing I found was that some of the early
RTX 2080Ti releases were defective. But in almost all the cases, heating up had lead to an error. It is definitely not PyTorch; most likely the driver or the hardware below it. I figured if anywhere, this community is most likely to see these issues, now, or in the future.
In my situation, these errors don’t happen midst of an experiment when the card is operating at max capacity. It doesn’t heat up; I can’t even get past the first iteration of the training on the
RTX 2080Ti anymore. The PSU is
1000W which I believe is more than necessary. No manual overclocking either, not sure if the manufacturer had done anything though.
I’ll update the driver to 415 that is a beta to see if it will fix the issue. I’ll post here if I had any luck.
Hi, yeah we encountered the same issue a few days ago where one of our two RTX2080Ti was blowing up the activations and the gradients without any reason, resulting in
nan all over the place. The other card works just fine. Here is the thread we were running: https://discuss.pytorch.org/t/different-losses-on-2-different-machines/.
We ended up sending the GPU back to the manufacturer since it was under a warranty. I think the overheating had damaged the memory – we did not overclock the card. When I connected the GPU to a display, I could see several artifacts. The manufacturer sent us a new GPU after two weeks.
Apparently, if you’re using multiple GPUs, especially the
RTX 2080Tis, you should be using the
blower fan type. See this.