I’m working with a machine that has 2 RTX 2080ti’s. I can move my model and data over to GPU by using both .to("cuda:1")
or a context manager: with torch.cuda.device(1):
I’ve opted for the latter just just for the sake of clarity and simplicity. In both instances though my program crashes when training on the second GPU. If I reboot the machine and attempt to run training on the second GPU, it may function for 5 minutes, but it will inevitably start throwing NaN
loss from the Loss function, followed shortly by a crash.
Following this, if I try to start the training again, it will throw one of the follow errors:
cublash runtime error: resource allocation failed at ...THCGeneral.cpp:228
or a cuDNN
error
or some CUDA illegal execution
error.
I’ve attempted running both Pytorch Stable and the Nightly Build. This is all taking place on a freshly installed machine running Ubuntu 18.04, CUDA 10 and cuDNN 7.4.
Is this a driver problem, where Pytorch or CUDA doesn’t properly support the hardware yet? Or could this be a hardware problem (malfunctioning or damaged)?
Any thoughts? I would attempt a Windows 10 install if that would assist in a diagnosis or if that could avoid this problem.
Thanks!
edit:
I should also say that if I run the same python script but locating everything on the first GPU (cuda(0)
) everything runs fine and doesn’t run into any problems.
edit 2:
Based on other threads on this forum, I attempted to run again using CUDA_LAUNCH_BLOCKING=1
prepended to my python call.
This results, consistently, in the following error:
Traceback (most recent call last):
File "TrainConstraints.py", line 98, in <module>
loss.backward()
File "/home/jeremy/anaconda3/envs/mika_test/lib/python3.7/site-packages/torch/tensor.py", line 106, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/jeremy/anaconda3/envs/mika_test/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED