C++/cuda custom function: RuntimeError: CUDA error: invalid device function

thank you very much.
indeed, the conflict between cuda local and the one used to built pytorch was the causse of the issue.
after fixing the nvcc path to the rigth cuda, it worked.

i relied on nvidia-smi to get the right cuda version, but i shouldnt have.
nvcc -V is the right tool.

one question: can we compile the extension without gpu available, but it will be available later?
in some clusters, we have access to the frontal where there is no gpus. it is upon request that we are allocated gpus.
i read in the doc, that there are 2 ways to build extensions: ahead and just in time.
so, in my case, i should probably use the second option.
but i was wondering if the first option is viable? so i do the install only once.

i think my answer is in the doc that we cant do the first option in this case, right?

By default the extension will be compiled to run on all archs of the cards visible during the building 
process of the extension, plus PTX. 
If down the road a new card is installed the extension may need to be recompiled. 

they said may need to be recompiled and not have… so, not sure.

in case i use jit method and there are multiple gpus. 2 cases are presented:

  1. my code will use only one (more likely to not know which one the moment when loading the extension)
  2. my code will use miltui-gpus with ddp (what if gpus have different arch??).

i assume that jit will handle both cases automatically without any additional config. right?

about the runtime when using jit. from the doc:

lltm_cpp = load(name="lltm_cpp", sources=["lltm.cpp"])

The first time you run through this line, it will take some time, as the extension is compiling 
in the background. Since we use the Ninja build system to build your sources, re-compilation is
 incremental and thus re-loading the extension when you run your Python module a second 
time is fast and has low overhead if you didn’t change the extension’s source files.

i assume the expensive time they are talking about is the loading when the compilation happens.
once loaded, the runtime of calling is the same as the ahead-method.
i mention this because in my case, every time i run my code, it is allocated a new different gpu. so, the compilation directory will be dependent on the job and it will temporary (i.e. will be deleted after the job is done.). when using ddp, i can set that only the main process will load/compile the extension. once done, the other processes will have access to it.
thanks

this experience brings me to this natural question related to compiling extensions and this versions issue:

i read several times that when installing pytorch using conda install pytorch==1.9.0 cudatoolkit=11.1 -c pytorch -c nvidia, that cuda toolkit will be shipped with the installation.

i read other answers about this including yours and this one.

because the nvcc is part of the cuda toolkit and not the driver, i assume that nvcc is shipped with the installation as well. is that right?

by the way, i did a search in the virtual env of conda for cuda toolkit, nvcc, and i couldnt find any. probably they are hidden in a lib or something. do you know where they are installed?

i think you know now where i am going with this.
if we have access to the cuda toolkit that was used to build the installed binary of pytorch, and since this last one wont using anyway the local install of cuda runtime, could we use the shipped cuda toolkit to compile new extensions allowing us to be independent from the local cuda runtime installation (that could be not up to date, messy, …)?
this could be a huge benefit because we are sure that the extension is complied with the same exact cuda version that was used to build pytorch, right?

you said yourself in the threads mentioned above, that when compiling extensions with pytorch that was installed as above (i.e. with the shipped cuda toolkit), one needs to install locally the same version as the one used to built pytorch. if we have access to the shipped nvcc and cuda stuff could we skip this install step? or there are other things necessary for the compilation that were not shipped? one of the comments mentioned that cuda lib is huge to be shipped with pytorch. that comment was in 2019, because in the same thread, and in 2020, you said that cuda toolkit is shipped. i am not sure if there is a difference between cuda toolkit and cuda library.

again, thank you very much for solving this issue. it was a huge help.
i really appreciate it!

i apologize for the long comment