Building wheels for C++/Cuda extensions


I have c++/cuda extension, and to simplify deployment I would like to build wheels for it.

However, there seems to be a combinatorial explosion in the number of wheels necessary to build.

In particular, if I’m not mistaken, I need to provide one wheel for every combination of (supported)

  • minor python versions
  • minor torch versions
  • cuda toolkit versions
  • operating systems

This seems a bit much. Is there a better way of dealing with this? Note that the c++ extension part is essentially just starting the cuda kernels, and is certainly not performance critical.

In theory I would assume that providing a single compiled ptx file should be enough & then torch could contain the minimal code neccessary for dispatching kernels with provided dims etc. But I don’t think any such functionality exists. How do people normally deal with this?

I don’t think that providing the ptx would be enough, as the user would still need to compile the code and thus would need a local CUDA toolkit installation, which doesn’t seem to be a huge benefit compared to building your extension.
The advantage of building wheels would be to avoid the necessity of a local CUDA toolkit and I think you are right that the support matrix is not small in case you want to support different architectures etc.

1 Like

Thanks for the answer!

Hmm, cuModuleLoad seemingly accepts PTX and is part of the driver api. As long as torch would provide the cuModuleLoad + functionality for launching the kernels I think a single ptx would be all that is needed, independent of user configuration. I might well be missing something though.

Thanks though, my main question was if there was something already existing that provides functionality to simplify this process. Good to know that I’m not missing something