[NEED HELP] Trouble with CUDA capability sm_86

Ilias_Giannakopoulos · May 4, 2021, 3:35pm

Hi all,

I am trying to train a network on my NVIDIA RTX 3070. I receive the following error:

NVIDIA GeForce RTX 3070 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3070 GPU with PyTorch.

Any help will be appreciated!

eqy · May 4, 2021, 5:07pm

It looks like your version of PyTorch is too old for compute capability 8.6 (Ampere). Are you able to update your installation or rebuild PyTorch?

Ilias_Giannakopoulos · May 4, 2021, 5:43pm

Hi eqy,

I am trying with the latest stable version of PyTorch that works with CUDA 11.1 (also tried with 10.2, but didn’t have any luck).
Could it be that NVIDIA 3070 works with CUDA 11.2 and higher, while PyTorch supports up to CUDA 11.1 for the moment?

Thanks in advance!

eqy · May 4, 2021, 6:11pm

That seems strange as there should be Ampere support and has been for a while now IIRC. If you can build the latest version of PyTorch you can specify TORCH_CUDA_ARCH_LIST="8.6" in your environment to force it to build with SM 8.6 support.

Ilias_Giannakopoulos · May 4, 2021, 7:34pm

Hi eqy,

Thanks a ton! I specified `TORCH_CUDA_ARCH_LIST on bash. After that I redownloaded PyTorch and the network seems to be training properly.

Best,

Ilias

ptrblck · May 5, 2021, 1:05am

NIT: TORCH_CUDA_ARCH_LIST will be used while building from source and won’t change the shipped compute capabilities in the binaries, which you can get via print(torch.cuda.get_arch_list()).

That’s expected, since Ampere GPUs need CUDA>=11.0

No, the 3070 uses sm_86, which is natively supported in CUDA>=11.1 and is binary compatible to sm_80, so would already work in CUDA=11.0.

In any case, good to hear it’s working now

Ilias_Giannakopoulos · May 5, 2021, 2:26am

Hi prtblck,

Thanks a lot for the clarifications. So the mistake was that I hadn’t specified TORCH_CUDA_ARCH_LIST on the PATH?

ptrblck · May 5, 2021, 5:18am

No, I don’t think so, as it doesn’t change the behavior of the binaries and is only used if you build PyTorch from source, which isn’t the case if I understand your workflow correctly.
The original error message is raised, if you’ve installed a pip wheel or conda binary, which doesn’t support your architecture. Based on the message:

it seems you were using a pip wheel with CUDA<=10.2.

Ilias_Giannakopoulos · May 5, 2021, 2:02pm

Indeed I did exactly that: pip with CUDA 10.2. I think I messed up the installation when I tried to download PyTorch with 11.1 later.

Thanks again for the help and the clarifications!

rojas70 · August 10, 2021, 10:23am

@ptrblck I am having the same problem as @Ilias_Giannakopoulos , I began by installing pytorch as indicated on their website:

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

How might I verify if the pip wheel used CUDA<=10.2? I wanted to learn how to check for this.

In fact, my problem is more nuanced. I can get the code to work when I run it on the python/ipython interpreters, but any time I try to debug with code or pycharm, I get the error message.

I have been on this for two weeks, trying and re-trying different combinations pytorch versions (and nvidia drivers/ cuda toolkit/libcudnn). I have checked many times the virtualenvironments I use and how I select them in code. I have tried everything I know except building from source, and have not been able to resolve this discrepancy on my system.

Environment Information: my setup tries to follow the Nvidia compatibility matrix: driver-470/toolkit 114/libcudnn8.2/pytorch1.9+cu111

PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.11.0-25-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.4.100
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.0
[pip3] torch==1.9.0+cu111 <---- Is this a problem? No cuda 114?
[pip3] torchaudio==0.9.0
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect

And sys.path looks good. Is there anything specific to IDEs that I may be missing on this?

ptrblck · August 10, 2021, 4:58pm

Based on your cross-post I would also assume that you pycharm is using another env with a different PyTorch installation.
I would thus either create a new virtual env and reinstall PyTorch + pycharm there or make sure to uninstall all PyTorch installations in the current and base environment and reinstall it in the current env only.

rojas70 · August 11, 2021, 6:26am

Thank you.

I discovered my virtual environment had problems. When I tried to install packages to it, they would be installed globally not locally. There was corruption throughout… I deleted it and started from scratch.

I came up with the following steps as a guide for anyone who would like to have a type of cheatsheet to verify their installation:

Nvidia Driver and CUDA Toolkit

If already installed, examine your Nvidia GPU driver version

nvidia-smi

or

cat /proc/driver/nvidia/version

Learn its architecture

sudo lshw -C display

Learn your current Linux kernel

uname -a

Look up the Nvidia Compatibility Matrix to determine the correct driver, toolkit, and libcudnn

Support Matrix - NVIDIA Docs

Support Matrix - NVIDIA Docs (gcc, glibc)

Install Driver

sudo apt install nvidia-driver-XXX

Install CUDA Toolkit

https://developer.nvidia.com/cuda-downloads

Install libcudnnX (useful to do deep learning with cuda)

Installation Guide - NVIDIA Docs

sudo apt install libcudnnX

Install pytorch

we will wait for this undtil you setup your virtualenv below.

Testing your system’s python setup

First note, the location of the system-wide python interpreter

which python3

Note the location of teh system-wide pip

which pip3

What packages are there globally (this command will also list packages that were installed via apt-get install)

python3 -m pip list (or alternatively python3 -m pip freeze)

Create virtualenv if not yet created

python3 -m venv name_for_your_env

Usually, you will be asked to install the required files; normally the file “requirements.txt”. Examine it and become familiar with it. From within your virtual environment , install them via:

python3 -m pip install -r requirements.txt

If not already installed, install pytorch.

You can get the pip3/conda command from here. Most people recommend conda/docker installs. We are doing pip3 to have more flexiblity with the packages we need with different repos.

go to https://pytorch.org/ (choose your config). You may get a command like:

Note that if a package is properly installed, it should appear in your virtual_env/lib/pythonX.X/site-packages forlder.

Additionally, ensure your pythonpath is properly set (learn more about pythonpath/imports/sys.path here: The Definitive Guide to Python import Statements | Chris Yeh)

pythonpath is a environment variable that contains paths to load python modules/scripts that are not binaries (i.e. located.

The pythonpath env variable is set in the .bashrc file found in your user folder (the user folder is located at ~/ and the “.” means it is a hidden file). Use your favorite editor to open it:

emacs ~/.bashrc

Look to see if you already set any pythonpath’s.

export PYTHONPATH=$PYTHONPATH:/new/path1/goes/here:/new/path2/goes/here:

Sanity Checks for torch/gpu

In your virtualenv, open a python interpreter:

python3 (or even better ipython3 – you will need to install first pip3 install ipython).

Check the system path from which modules are loaded

import sys

sys.path (should not see undesired paths here).

Import torch

impor torch

Double check that this torch module is located inside your virtual environment

import imp

imp.find_module(‘torch’) → should return a path in your virtualenv

Check the version of your torch module and cuda

torch.version

torch.version.cuda

Check the supported architectures

torch.cuda.get_arch_list()

Check for the number of gpu detected

torch.cuda.device_count()

Can you read the device?

device=torch.device(‘cuda:0’) # 0 by default, if you have more gpu’s increase your index.

Edwardius · October 13, 2021, 7:14am

Hi,

Having the exact same issue.

NVIDIA GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.

How would you check the CUDA version within the pip wheel? Currently the I have torch 1.9.1 and torchvision 0.10.1 in the virtual environment. When doing ‘nvidia-smi’ I get a cuda version of 11.4. Any pointers?

ptrblck · October 13, 2021, 7:28am

torch.cuda.get_arch_list() will return the compute capabilities used in your current PyTorch build and torch.version.cuda will return the used CUDA runtime.

Edwardius · October 13, 2021, 11:42pm

10.2 is the CUDA runtime currently being used. How would I fix this? Is there a way to specify the CUDA version used in the pip wheel?

ptrblck · October 14, 2021, 12:04am

Yes, you can select the CUDA version in this UI.

Edwardius · October 14, 2021, 12:45am

And it doesn’t matter that my root contains cuda 11.4 because I’ll be downloading this version into the venv?

ptrblck · October 14, 2021, 5:55am

It depends a bit on your use case. The pip wheels and conda binaries ship with their own CUDA runtime and you will be able to run PyTorch code without using the local CUDA toolkit (as long as the right CUDA runtime is selected; e.g. for Ampere GPUs you have to use CUDA>=11).
However, if you want to build a custom CUDA extension, the local CUDA toolkit will be used and you should install a matching version to the runtime used in PyTorch.

Nakkhatra · October 15, 2021, 6:51pm

Hi @Ilias_Giannakopoulos , I am facing the same issue with my RTX 3070 and I haven’t installed torch globally, I installed torch only inside a conda environment and I’m using ubuntu as OS. How can I add TORCH_CUDA_ARCH_LIST? Is it possible to add this command while installing with conda?

Nakkhatra · October 15, 2021, 7:09pm

@Ilias_Giannakopoulos I tried
TORCH_CUDA_ARCH_LIST=“8.6” conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

but it didn’t work.