Nvidia rtx a6000 gpu incompatible with pytorch

tjk · October 30, 2021, 2:40am

Hello guys:
I try to run a machine learning program with NVIDIA RTX A6000 graphical cards and Pytorch, but confronted with the following problem:
Traceback (most recent call last):

File"D:\jktong\LCNIl-FIT-exp-size\250by250\Train_directions_complex_gpu_large_u7.3.py", line 223, in loss.backward()
File “c: \Users \hplanaconda3\libisite-packages \torch\tensor.py " , line 245,in backward torch.autograd.backward(self, gradient,retain_graph,create_graph，inputs=inputs)
File"c:\Users\hplanaconda3\liblsite-packages \torchlautogradl_init_.py",line 145,in backward variable._execution_engine.run_backward(

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
The program can be run smoothly on three NVIDIA RTX 2080Ti graphical cards, so it seems that our program has no problems.

The cuda version of our workstation is 11.1, cudnn version is 11.3 and pytorch version is 1.8.2.
I have tried to search for the recommended version of pytorch with this graphical cards, but it seems this card is too new, could you please give us some suggestions to solve the problem (like recommend a suitable version of pytorch for this GPU)? Thank you!

ptrblck · October 30, 2021, 4:56am

I guess you might be using the PyTorch binaries with the CUDA 10.2 runtime, while you would need CUDA>=11.0. Check the shipped CUDA version via print(torch.version.cuda) and make sure it’s 11.

Your local CUDA toolkit won’t be used if you’ve installed the binaries, as they ship with their own CUDA runtime, cuDNN, NCCL etc., unless you are building from source or a custom CUDA extension (also cuDNN is currently at version 8, so you might have read the wrong version tag).

tjk · November 4, 2021, 1:18pm

Hello, we uninstalled the CUDA 10.0 on our workstation before re-install CUDA 11.0, will the cases said by you occur in the computer?

tjk · November 4, 2021, 1:51pm

We executed print(torch.version.cuda) and it outputs 11.1

tjk · November 4, 2021, 2:06pm

if I try to print the loss object with spyder, it will print the following error:
loss
Traceback (most recent call last):

File “C:\Users\hp\anaconda3\lib\site-packages\IPython\core\formatters.py”, line 702, in call
printer.pretty(obj)

File “C:\Users\hp\anaconda3\lib\site-packages\IPython\lib\pretty.py”, line 394, in pretty
return _repr_pprint(obj, self, cycle)

File “C:\Users\hp\anaconda3\lib\site-packages\IPython\lib\pretty.py”, line 700, in _repr_pprint
output = repr(obj)

File “C:\Users\hp\anaconda3\lib\site-packages\torch\tensor.py”, line 193, in repr
return torch._tensor_str._str(self)

File “C:\Users\hp\anaconda3\lib\site-packages\torch_tensor_str.py”, line 383, in _str
return _str_intern(self)

File “C:\Users\hp\anaconda3\lib\site-packages\torch_tensor_str.py”, line 358, in _str_intern
tensor_str = _tensor_str(self, indent)

File “C:\Users\hp\anaconda3\lib\site-packages\torch_tensor_str.py”, line 242, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)

File “C:\Users\hp\anaconda3\lib\site-packages\torch_tensor_str.py”, line 90, in init
nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))

RuntimeError: CUDA error: an illegal memory access was encountered