cudaCheckError() failed: PyTorch 0.4 and CUDA11

Arda_Tumay · October 18, 2021, 12:35pm

Hello, I really need guidance about the situation I faced.

Let me give details about my working environment:

Working in Google Colab
Pytorch version is 0.4.1.post2
GPU: Tesla k80, Driver Version: 460.32.03, CUDA Version: 11.2, Compute Cap: 3.7

I have a network written in PyTorch 0.4 which is an old version. The network uses Fast RCNN as a backbone and there are some codes written in Cuda as an extension to the python script for making non-max suppression and ROI align faster. Because of these codes, I cannot easily switch to the latest version of PyTorch. I tried and I got the error of “ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.” So I installed PyTorch 0.4 and training starts without any error until it comes to codes that PyTorch utilizes the GPU. When the code tries to utilize GPU it hits “cudaCheckError() failed : no kernel image is available for execution on the device” error. I really looked at similar questions and I inferred that PyTorch 0.4 is a very old version and I need to build it from source to run with CUDA 11. Moreover, Cuda codes are previously compiled with these options, “-x cu -Xcompiler -fPIC -arch=sm_60”.

At this point, I confused. I am not sure sm_60 is a right architecture for Tesla K80. Can this be the cause of the problem? Will it be enough to only compile cuda codes with proper options? Or if I need to build pytorch 0.4 from sources, how do I build it with CUDA 11 support? And is it possible to do that in Google Colab?

I really need someone to make my mind clear about this problem because I dig the web too much and all answers I found said different things.

Thank you very much in advance for any help!

ptrblck · October 18, 2021, 7:47pm

The K80 wouldn’t need CUDA11, so you should be able to build the extension with the matching CUDA version which is used for your PyTorch build.

No, the K80 has the compute capability 3.7, so building for sm_60 won’t work on this device.

This won’t be easily possible, since the CUDA11 support landed in PyTorch ~1.7. In any case you wouldn’t need CUDA11 as described before.

Arda_Tumay · October 19, 2021, 7:54am

First of all, thank you very much for your answer.

Based on your answer, the problem is most probably because of the wrong settings that the Cuda codes are compiled. The command to build Cuda codes was

nvcc -c -o nms_kernel.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=**sm_60**

and I will change it to

nvcc -c -o nms_kernel.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=**sm_37**

I hope it will work this time with PyTorch 0.4.1.post2 in Google Colab.

Moreover, there should be some misleading information in Google Colab about Tesla K80. Below picture was the output of the command of !nvidia-smi

Finally, I want to ask two more things about the installation from source process. If I want to build PyTorch from the source, how to compile it with a specific CUDA version? Or is it automatically built with the CUDA installed in the machine?

And is there any compatibility table that shows PyTorch versions and their compatible CUDA versions? I am asking this because I have RTX 2080 Super card in my local machine and I may have to run PyTorch 0.4 on my local machine in near future. RTX2080 super has compute capability of 7.5.

Thank you again your kind answer.

ptrblck · October 19, 2021, 8:28am

The shown CUDA version would either refer to the locally installed CUDA toolkit (if available) or the CUDA toolkit release version coming with the driver, so not necessarily the one used in PyTorch (the PyTorch binaries ship with their own CUDA runtime).

You can specify the location to the locally installed CUDA toolkit via:

CUDA_HOME=/usr/local/cuda python setup.py install

If you don’t set CUDA_HOME then the default location(s) will be searched.

Not that I’m aware of, as the general advice is to update to the latest release, which is compatible with all CUDA versions >=10.2, which also includes your RTX2080 Super.
I would thus highly recommend not to use PyTorch 0.4 as it was released in ~April 2018.

Arda_Tumay · October 25, 2021, 1:56pm

First of all, thank you for your answers, I made it to run my code with pytorch 0.4 with cuda 9 in Google Colab. But there was a bug in my code in pytorch 0.4 and I downgraded it to pytorch 0.3 and it worked.

However, in my local setup, since RTX2080s required Cuda 10 minimum, I couldnt find any binary of pytorch 0.3 compiled with cuda 10 for linux in this link https://download.pytorch.org/whl/cu100/torch_stable.html.

So, most probably, there are two options for me now.

I will try to compile PyTorch 0.3 from the source with Cuda 10
I will try to rewrite nms and roi align extensions written in c and cuda to make the code work with the last version of PyTorch(1.9).

I am not sure which one will be difficult to do. Because without rewriting, I hit this error ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead with last version of pytorch.

Mayank · March 13, 2022, 7:17pm

Hi @Arda_Tumay

I am also facing similar issue as like yours.

I would like to ask if you solved your error.
I am also working on GitHub - skyhehe123/VoxelNet-pytorch and it uses torch 0.3.1, which I can not use it as my cuda is 11.3 version so i was using torch 1.10. So I am not sure what to do now.

If you can share your solution experience, I can really use the help.

Thank you.

Arda_Tumay · April 18, 2022, 5:13pm

Hı @Mayank

I am sad to say that I did not find any solution for this problem. There are two ways you can follow I think, the first way is finding another repo for VoxelNet whose torch version is relatively new and can work with new CUDA versions and the second solution is upgrading torch version of the repo you found and fix the problems one by one that is caused by new torch version.

If you try hard, you can run the repo you found whose torch version is 0.3 on a GPU with CUDA 9. You also need to set the gcc and g++ version to 6, if the code have extensions written in C++. If so, those extensions should also be compiled with correct torch version, correct gcc version and correct CUDA version which is 9 for torch 0.3 and 0.4. But nevertheless there will be problems… So I suggest you find a new code with new version of torch.

I hope I could have helped you with my answer. Write here if you face further problems.

Mayank · April 20, 2022, 9:25am

Hi @Arda_Tumay

Thank you for your reply and suggestion. You said very rightly.
Running the older version is giving me different issues which is taking a lot of my time.
Therefore I have started working on other repo now