Multiple GPUs NaN And then CUDA Crashes [CLOSED]

I’m working with a machine that has 2 RTX 2080ti’s. I can move my model and data over to GPU by using both .to("cuda:1") or a context manager: with torch.cuda.device(1):

I’ve opted for the latter just just for the sake of clarity and simplicity. In both instances though my program crashes when training on the second GPU. If I reboot the machine and attempt to run training on the second GPU, it may function for 5 minutes, but it will inevitably start throwing NaN loss from the Loss function, followed shortly by a crash.

Following this, if I try to start the training again, it will throw one of the follow errors:

cublash runtime error: resource allocation failed at ...THCGeneral.cpp:228

or a cuDNN error

or some CUDA illegal execution error.

I’ve attempted running both Pytorch Stable and the Nightly Build. This is all taking place on a freshly installed machine running Ubuntu 18.04, CUDA 10 and cuDNN 7.4.

Is this a driver problem, where Pytorch or CUDA doesn’t properly support the hardware yet? Or could this be a hardware problem (malfunctioning or damaged)?

Any thoughts? I would attempt a Windows 10 install if that would assist in a diagnosis or if that could avoid this problem.

Thanks!

edit:

I should also say that if I run the same python script but locating everything on the first GPU (cuda(0)) everything runs fine and doesn’t run into any problems.

edit 2:

Based on other threads on this forum, I attempted to run again using CUDA_LAUNCH_BLOCKING=1 prepended to my python call.

This results, consistently, in the following error:

Traceback (most recent call last):
  File "TrainConstraints.py", line 98, in <module>
    loss.backward()
  File "/home/jeremy/anaconda3/envs/mika_test/lib/python3.7/site-packages/torch/tensor.py", line 106, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/jeremy/anaconda3/envs/mika_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Sadly this appears to actually be a hardware failure problem, as I’ve traced it as far as possible and the problem isn’t reproducible on a different machine… and through the course of testing various drivers and clean install to mitigate this problem, I’ve begun experiencing visual display artifacts that can only be explained by a GPU on its deathbed.

Can consider this closed.