Getting cuda version 9.0.17, but nvcc shows the cuda version to be 10.0.130

Hi guys,

I’m troubling with the cuda version for pytorch. Lately I’ve changed my working environment from Titan Xp to RTX2080ti. Then the same code went into problems like this:

File "train.py", line 183, in train
    outputs = model(inputs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/lane-detection/lib/network_zoo/DANet.py", line 112, in forward
    x = self.head(c4)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/lane-detection/lib/network_zoo/DANet.py", line 153, in forward
    sa_feat = self.sa(feat1)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/lane-detection/lib/network_zoo/DANet.py", line 46, in forward
    energy = torch.bmm(proj_query, proj_key)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:450
^CException ignored in: <module 'threading' from '/usr/lib/python3.5/threading.py'>

I’m using 8 gpus with DataParallel module, after looking into this problem for a while, I figured out that RTX2080ti requires cuda 10.

However, the thing is that I did install cuda 10, and I verified this with nvcc version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

But in a ipython kernel, torch.version.cuda shows:

In [2]: torch.version.cuda
Out[2]: '9.0.176'

How can I do with it?

Thanks in advance.

Did you install PyTorch from source or did you use the binaries?
In the latter case CUDA (and other libs) will be shipped in the binaries, so that you don’t have to install a local CUDA version (and it will be ignored).
Have a look at the install instructions and select the command for CUDA10.
I would recommend the conda install command, as the pip install currently might downgrade to CUDA9 even if CUDA10 was selected as tracked here.

1 Like

I installed PyTorch with pip, but I’m not sure if I can use conda in my environment for some reason. Any chance that I could switch the version of cuda to 10 without using conda?

FYI, I found that this problem occurs only if the batchsize on each gpu is over 1. If there is exactly one image on each gpu, nothing would happen.

These commands should install the stable version with CUDA10:

pip3 install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp37-cp37m-linux_x86_64.whl
pip3 install https://download.pytorch.org/whl/cu100/torchvision-0.3.0-cp37-cp37m-linux_x86_64.whl

CUDA9 works in some cases with Turing GPUs, but is generally not recommended, so we should make sure you can properly install CUDA10.
Let me know, if these commands worked for you.

OK, i’ll give it a try. Thank you a lot for your kindness.

In case it helps anyone, I had the same issue and installing pytorch using the install instructions (conda install) worked for me.