[NEED HELP] Trouble with CUDA capability sm_86

It looks like your version of PyTorch is too old for compute capability 8.6 (Ampere). Are you able to update your installation or rebuild PyTorch?

Hi eqy,

I am trying with the latest stable version of PyTorch that works with CUDA 11.1 (also tried with 10.2, but didn’t have any luck).
Could it be that NVIDIA 3070 works with CUDA 11.2 and higher, while PyTorch supports up to CUDA 11.1 for the moment?

Thanks in advance!

That seems strange as there should be Ampere support and has been for a while now IIRC. If you can build the latest version of PyTorch you can specify TORCH_CUDA_ARCH_LIST="8.6" in your environment to force it to build with SM 8.6 support.

Hi eqy,

Thanks a ton! I specified `TORCH_CUDA_ARCH_LIST on bash. After that I redownloaded PyTorch and the network seems to be training properly.

Best,

Ilias

3 Likes

NIT: TORCH_CUDA_ARCH_LIST will be used while building from source and won’t change the shipped compute capabilities in the binaries, which you can get via print(torch.cuda.get_arch_list()).

That’s expected, since Ampere GPUs need CUDA>=11.0

No, the 3070 uses sm_86, which is natively supported in CUDA>=11.1 and is binary compatible to sm_80, so would already work in CUDA=11.0.

In any case, good to hear it’s working now :slight_smile:

2 Likes

Hi prtblck,

Thanks a lot for the clarifications. So the mistake was that I hadn’t specified TORCH_CUDA_ARCH_LIST on the PATH?

No, I don’t think so, as it doesn’t change the behavior of the binaries and is only used if you build PyTorch from source, which isn’t the case if I understand your workflow correctly.
The original error message is raised, if you’ve installed a pip wheel or conda binary, which doesn’t support your architecture. Based on the message:

it seems you were using a pip wheel with CUDA<=10.2.

Indeed I did exactly that: pip with CUDA 10.2. I think I messed up the installation when I tried to download PyTorch with 11.1 later.

Thanks again for the help and the clarifications!

@ptrblck I am having the same problem as @Ilias_Giannakopoulos , I began by installing pytorch as indicated on their website:

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

How might I verify if the pip wheel used CUDA<=10.2? I wanted to learn how to check for this.

In fact, my problem is more nuanced. I can get the code to work when I run it on the python/ipython interpreters, but any time I try to debug with code or pycharm, I get the error message.

I have been on this for two weeks, trying and re-trying different combinations pytorch versions (and nvidia drivers/ cuda toolkit/libcudnn). I have checked many times the virtualenvironments I use and how I select them in code. I have tried everything I know except building from source, and have not been able to resolve this discrepancy on my system.

Environment Information: my setup tries to follow the Nvidia compatibility matrix: driver-470/toolkit 114/libcudnn8.2/pytorch1.9+cu111

PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.11.0-25-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.4.100
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.0
[pip3] torch==1.9.0+cu111 <---- Is this a problem? No cuda 114?
[pip3] torchaudio==0.9.0
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect

And sys.path looks good. Is there anything specific to IDEs that I may be missing on this?

2 Likes

Based on your cross-post I would also assume that you pycharm is using another env with a different PyTorch installation.
I would thus either create a new virtual env and reinstall PyTorch + pycharm there or make sure to uninstall all PyTorch installations in the current and base environment and reinstall it in the current env only.

Thank you.

I discovered my virtual environment had problems. When I tried to install packages to it, they would be installed globally not locally. There was corruption throughout… I deleted it and started from scratch.

I came up with the following steps as a guide for anyone who would like to have a type of cheatsheet to verify their installation:

Nvidia Driver and CUDA Toolkit

  1. If already installed, examine your Nvidia GPU driver version

nvidia-smi

or

cat /proc/driver/nvidia/version

  1. Learn its architecture

sudo lshw -C display

  1. Learn your current Linux kernel

uname -a

  1. Look up the Nvidia Compatibility Matrix to determine the correct driver, toolkit, and libcudnn

Support Matrix - NVIDIA Docs

Support Matrix - NVIDIA Docs (gcc, glibc)

  1. Install Driver

sudo apt install nvidia-driver-XXX

  1. Install CUDA Toolkit

https://developer.nvidia.com/cuda-downloads

  1. Install libcudnnX (useful to do deep learning with cuda)

Installation Guide - NVIDIA Docs

sudo apt install libcudnnX

  1. Install pytorch
  • we will wait for this undtil you setup your virtualenv below.

Testing your system’s python setup

  1. First note, the location of the system-wide python interpreter

which python3

  1. Note the location of teh system-wide pip

which pip3

  1. What packages are there globally (this command will also list packages that were installed via apt-get install)

python3 -m pip list (or alternatively python3 -m pip freeze)

  1. Create virtualenv if not yet created

python3 -m venv name_for_your_env

  1. Usually, you will be asked to install the required files; normally the file “requirements.txt”. Examine it and become familiar with it. From within your virtual environment , install them via:

python3 -m pip install -r requirements.txt

  1. If not already installed, install pytorch.

You can get the pip3/conda command from here. Most people recommend conda/docker installs. We are doing pip3 to have more flexiblity with the packages we need with different repos.

  1. Note that if a package is properly installed, it should appear in your virtual_env/lib/pythonX.X/site-packages forlder.

  2. Additionally, ensure your pythonpath is properly set (learn more about pythonpath/imports/sys.path here: The Definitive Guide to Python import Statements | Chris Yeh)

  • pythonpath is a environment variable that contains paths to load python modules/scripts that are not binaries (i.e. located.

  • The pythonpath env variable is set in the .bashrc file found in your user folder (the user folder is located at ~/ and the “.” means it is a hidden file). Use your favorite editor to open it:

emacs ~/.bashrc

  • Look to see if you already set any pythonpath’s.

export PYTHONPATH=$PYTHONPATH:/new/path1/goes/here:/new/path2/goes/here:

Sanity Checks for torch/gpu

  1. In your virtualenv, open a python interpreter:

python3 (or even better ipython3 – you will need to install first pip3 install ipython).

  1. Check the system path from which modules are loaded

import sys

sys.path (should not see undesired paths here).

  1. Import torch

impor torch

  1. Double check that this torch module is located inside your virtual environment

import imp

imp.find_module(‘torch’) → should return a path in your virtualenv

  1. Check the version of your torch module and cuda

torch.version

torch.version.cuda

  1. Check the supported architectures

torch.cuda.get_arch_list()

  1. Check for the number of gpu detected

torch.cuda.device_count()

  1. Can you read the device?

device=torch.device(‘cuda:0’) # 0 by default, if you have more gpu’s increase your index.

2 Likes

Hi,

Having the exact same issue.

NVIDIA GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.

How would you check the CUDA version within the pip wheel? Currently the I have torch 1.9.1 and torchvision 0.10.1 in the virtual environment. When doing ‘nvidia-smi’ I get a cuda version of 11.4. Any pointers?

torch.cuda.get_arch_list() will return the compute capabilities used in your current PyTorch build and torch.version.cuda will return the used CUDA runtime.

1 Like

10.2 is the CUDA runtime currently being used. How would I fix this? Is there a way to specify the CUDA version used in the pip wheel?

Yes, you can select the CUDA version in this UI.

And it doesn’t matter that my root contains cuda 11.4 because I’ll be downloading this version into the venv?

It depends a bit on your use case. The pip wheels and conda binaries ship with their own CUDA runtime and you will be able to run PyTorch code without using the local CUDA toolkit (as long as the right CUDA runtime is selected; e.g. for Ampere GPUs you have to use CUDA>=11).
However, if you want to build a custom CUDA extension, the local CUDA toolkit will be used and you should install a matching version to the runtime used in PyTorch.

Hi @Ilias_Giannakopoulos , I am facing the same issue with my RTX 3070 and I haven’t installed torch globally, I installed torch only inside a conda environment and I’m using ubuntu as OS. How can I add TORCH_CUDA_ARCH_LIST? Is it possible to add this command while installing with conda?

@Ilias_Giannakopoulos I tried
TORCH_CUDA_ARCH_LIST=“8.6” conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

but it didn’t work.

No, as described here:

TORCH_CUDA_ARCH_LIST is an env var used for a source build and won’t change anything in the binaries.
You are currently selecting cudatoolkit=10.2 so use 11.1 for your Ampere GPU.