(Updated) NVIDIA RTX A6000 INCOMPATIBLE WITH PYTORCH

ptrblck · November 17, 2021, 9:47am

No, I wasn’t able to see any issues in the logs. I’ve additionally forwarded the logs to our cuDNN team to check them, but it doesn’t seem that cuDNN is failing but might be reraising another error.
Could you disable cuDNN via torch.backends.cudnn.enabled = False and run the script for some time to see if it would raise another error, please?

tjk · November 17, 2021, 10:05am

OK，I will try and see if it will raise another error.

tjk · November 19, 2021, 3:31am

This is the error I confronted when using torch.backends.cudnn.enabled = False

Use GPU
Epoch: 0

Traceback (most recent call last):
  File "Test_forum.py", line 88, in <module>
    loss.backward()
  File "/home/hp/.conda/envs/Pytorch/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hp/.conda/envs/Pytorch/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

tjk · November 19, 2021, 3:38am

And this error

Use GPU
Traceback (most recent call last):

  File "/media/hp/46E0111EE0111631/jktong/1CNN-FIT-exp-size/250by250/Test_forum.py", line 73, in <module>
    net.to(device)

  File "/home/hp/.conda/envs/Pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)

  File "/home/hp/.conda/envs/Pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)

  File "/home/hp/.conda/envs/Pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)

  File "/home/hp/.conda/envs/Pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

RuntimeError: CUDA error: an illegal memory access was encountered

tjk · November 19, 2021, 4:17am

I install pytorch 1.8.2+cu111 on the ubuntu 16.04 OS, and use

sudo lightdm stop

to run the script, it seems that it does not report any error, I will run the program fro several hours to see how it goes.

ptrblck · November 19, 2021, 8:04am

Thanks for the updates. Did you run the previous examples via CUDA_LAUNCH_BLOCKING=1 python script.py? If not, could you repeat it, please?

tjk · December 6, 2021, 4:22pm

Well, I think I figure out the reasons. The problem was caused by the conflict between nvidia driver and cuda 11.1. Install the nvidia driver with –no-opengl-files flag and cuda11.1 with runfile will make this A6000 GPU run smoothly on ubuntu 16.04LTS with desktop GUI and no error is reported.

If not, I will not be able to log in the ubuntu desktop after install the nvidia driver.

I think this issue can be summed up as follows:

Although it proves that this gpu is possible to run on a older ubuntu system like16.04LTS，the problems occured on windows 10 is not resolved，but I hope this solution can provide some insights into the development of nvidia drivers.
2.When I confronted with the problem and turn to after-sale service for help，they just said we are the first customers to buy the GPU，and they know nothing about deep learning and pytorch. So i feel really bad as the only way i can ask for help is the pytorch forum. I think there are great spaces for the nvidia after-sale service team to improve.