C++/cuda custom function: RuntimeError: CUDA error: invalid device function

hi,
i am using a c++/cuda custom extension.
when running the extension, i get this error:

RuntimeError: CUDA error: invalid device function

at the very first call of a function in the extension.

could the nvcc version and cuda versions be the cause since pytorch is shipped with its cuda?

the packages are maintained in a conda env.
the custom function was installed with:

python setup.y build
python setup.y install

within the virtual env.
this is the setup.py file:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='extensions',
    ext_modules=[
        CUDAExtension('HT_opp', [
            'cuda_opps/HT.cpp',
            'cuda_opps/HT_kernel.cu',
        ]),
    ],
    cmdclass={
        'build_ext': BuildExtension
    })

i am using a pytorch module that is using the extension.
the error is raised in this module when calling a function in the extension.

python 3.7.9
pytorch 1.9.0 installed with conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

CUDA Version of nvidia-smi: 11.1

still reading this

thanks

CUDA error: invalid device function could point to an architecture mismatch, i.e. I guess you are using a CUDA function potentially specified for another compute capability or are building for the from architecture. Since the issue is raised in your custom CUDA code, it’s a bit hard to be more specific.

thank you!
is there a way to tell for what compute capability a code was designed?
here is the c++/cuda code.
it was developed almost 2 years ago.
the file pl.py is the pytorch module that uses the extension HT_opp. the code of the extension is in cuda_opps.
the authors did not specify any requirements or specificity about the gpu.
i left them an issue about this since ~24h. no news yet.

thanks again

The custom CUDA extension build steps should either use TORCH_CUDA_ARCH_LIST or should check your local GPUs to select the compute capability. You could try to set TORCH_CUDA_ARCH_LIST manually and if this doesn’t work try to narrow down the offending operation, which creates the issue.

thanks.
my gpu is a teslap100 which is supposed to have a compute capa. of 6.0.

i tried with TORCH_CUDA_ARCH_LIST="6.0+PTX" python setup.py build, but nothing changed. i got the same error.

the doc says that by default, torch.utils.cpp_extension.CUDAExtension will build for all the visible archs.
so, it is supposed to build for the right cc. there are 2 gpus with same arch.

found similar issue with a code from another repo.

the error message is so vague. and it does not point to any file.
there is no makefile to change the cc as suggested here.

i tried this on 2 servers with 2 different hardware but similar virtual env.

i dont know how to get what is offending cuda.
either the code of cuda extension does not go well with the current version cuda/pytorch.
or something is wrong with the installation (conda)…

does current cuda version still support old cuda code?
not an expert in cuda, but when writing cuda extensions, does one needs to know before hand the gpu arch? if it is the case, it will make the code almost useless…

when the built of the cuda extension is successful, is there any reason that there will be issues at runtime, provided that the inputs are correct? since the extension is supposed to be built for all arch. i assume at the runtime it will choose the right arch for the current gpu.

i ll try to check their cuda code. when edited, it has to be rebuilt to check the impact.
i still lean toward that this may be caused by nvcc, pytorch, cuda, cc, conda incompatible versions that do not go well with the cuda extension code… but again, i am not an expert.

i thought extensions are easy to use. once the cuda/c++ part were written, which is the hard part, the rest is plug and play as presented in the tutorial. you build and call. what could go wrong? the built is successful, the installation is successful.

i learned somewhere that when importing the extension, when needs to import torch first to prepare some stuff (links to libs …)

so, it seems cuda extension code and pytorch cuda code maybe need to be built with the same versions or compatible versions at least. what it it is not the case? i installed the conda pre-built version, and i compiled this cuda extension on my machine. is there a chance that this difference in compilation could cause any issue?

i cant even tell what is wrong from that message error.
is there a way to make cuda tell a more helpful error? there should be a way to debug extensions, right?
does pytorch has some debugging tools for extensions to tell what’s wrong?

i really appreciate your help.
been stuck for 6 days at this stage. build and test one of these repos. it is a minor step. it could be done in 10 mins. 6 days later, and here i am unable to run a simple test provided by the authors.
authors didnt reply yet. i also sent an email. some seem in vacation.
will let you know.

thank you again

This could mean that the built binary itself should be correct.

Yes, assuming you are using the latest PyTorch release.

Your setup seems to be the issue, since you are mixing the CUDA runtime used in the PyTorch binaries (11.1) with the local CUDA toolkit used to build the extension (10.0), so you would need to stick to the same version. I’ve rebuild the extension on a server with a P100 using matching CUDA versions and this code snippet works fine:

from PAM_cuda.pl import PermutohedralLattice

if __name__ == '__main__':
    import numpy as np
    pl = PermutohedralLattice.apply

    im = torch.randn(24, 24, 3)
    indices = np.reshape(np.indices(im.shape[:2]), (2, -1))[None, :]
    im = im.permute(2, 0, 1)
    rgb = im.reshape(3, -1).unsqueeze(0)
    out = pl(torch.from_numpy(indices / 5.0).cuda().float(),
             (rgb / 0.125).cuda().float())

    output = out.squeeze().cpu().numpy()
    output = np.transpose(output, (1, 0))
    output = np.reshape(output, (im.shape[1], im.shape[2], 3))
    print(output)

thank you very much.
indeed, the conflict between cuda local and the one used to built pytorch was the causse of the issue.
after fixing the nvcc path to the rigth cuda, it worked.

i relied on nvidia-smi to get the right cuda version, but i shouldnt have.
nvcc -V is the right tool.

one question: can we compile the extension without gpu available, but it will be available later?
in some clusters, we have access to the frontal where there is no gpus. it is upon request that we are allocated gpus.
i read in the doc, that there are 2 ways to build extensions: ahead and just in time.
so, in my case, i should probably use the second option.
but i was wondering if the first option is viable? so i do the install only once.

i think my answer is in the doc that we cant do the first option in this case, right?

By default the extension will be compiled to run on all archs of the cards visible during the building 
process of the extension, plus PTX. 
If down the road a new card is installed the extension may need to be recompiled. 

they said may need to be recompiled and not have… so, not sure.

in case i use jit method and there are multiple gpus. 2 cases are presented:

  1. my code will use only one (more likely to not know which one the moment when loading the extension)
  2. my code will use miltui-gpus with ddp (what if gpus have different arch??).

i assume that jit will handle both cases automatically without any additional config. right?

about the runtime when using jit. from the doc:

lltm_cpp = load(name="lltm_cpp", sources=["lltm.cpp"])

The first time you run through this line, it will take some time, as the extension is compiling 
in the background. Since we use the Ninja build system to build your sources, re-compilation is
 incremental and thus re-loading the extension when you run your Python module a second 
time is fast and has low overhead if you didn’t change the extension’s source files.

i assume the expensive time they are talking about is the loading when the compilation happens.
once loaded, the runtime of calling is the same as the ahead-method.
i mention this because in my case, every time i run my code, it is allocated a new different gpu. so, the compilation directory will be dependent on the job and it will temporary (i.e. will be deleted after the job is done.). when using ddp, i can set that only the main process will load/compile the extension. once done, the other processes will have access to it.
thanks

this experience brings me to this natural question related to compiling extensions and this versions issue:

i read several times that when installing pytorch using conda install pytorch==1.9.0 cudatoolkit=11.1 -c pytorch -c nvidia, that cuda toolkit will be shipped with the installation.

i read other answers about this including yours and this one.

because the nvcc is part of the cuda toolkit and not the driver, i assume that nvcc is shipped with the installation as well. is that right?

by the way, i did a search in the virtual env of conda for cuda toolkit, nvcc, and i couldnt find any. probably they are hidden in a lib or something. do you know where they are installed?

i think you know now where i am going with this.
if we have access to the cuda toolkit that was used to build the installed binary of pytorch, and since this last one wont using anyway the local install of cuda runtime, could we use the shipped cuda toolkit to compile new extensions allowing us to be independent from the local cuda runtime installation (that could be not up to date, messy, …)?
this could be a huge benefit because we are sure that the extension is complied with the same exact cuda version that was used to build pytorch, right?

you said yourself in the threads mentioned above, that when compiling extensions with pytorch that was installed as above (i.e. with the shipped cuda toolkit), one needs to install locally the same version as the one used to built pytorch. if we have access to the shipped nvcc and cuda stuff could we skip this install step? or there are other things necessary for the compilation that were not shipped? one of the comments mentioned that cuda lib is huge to be shipped with pytorch. that comment was in 2019, because in the same thread, and in 2020, you said that cuda toolkit is shipped. i am not sure if there is a difference between cuda toolkit and cuda library.

again, thank you very much for solving this issue. it was a huge help.
i really appreciate it!

i apologize for the long comment

Yes, you can cross-compile for different architectures and create a pip wheel, but given the issues you were seeing in the past the JIT compilation might be the easier path for now.

The “expensive time” is used during the initial compilation and the next call will use the already built and cached lib, so will be much faster. E.g. using this example I get an initial startup time of:

Loading extension module lltm...

real	2m13.153s
user	2m14.363s
sys	0m16.594s

and further calls need:

Loading extension module lltm...

real	0m1.984s
user	0m2.198s
sys	0m6.165s

No, that’s not correct, as only the runtime is shipped, not the entire CUDA toolkit.

No, it’s not part of the conda cudatoolkit binaries.

ok, that clarifies things.
thank you very much for your help!