Long build time for custom c++/cuda extension

ymtanga · September 4, 2025, 7:20am

I followed the official tutorials to write a custom c++/cuda extension for pytorch.

The newest tutorial suggests using pip install --no-build-isolation to build and install the extension module. However, during development, I frequently update my CUDA code. I’ve found that pip install recompiles everything from scratch each time, even when most of the code hasn’t changed. It takes me over 6 hours to recompile tens custom operators for just a small modification…

I also tried directly running python setup.py install to build my extension as described in the deprecated c++/cuda extension tutorial. This method keeps the build folder, which stores all object files that can be reused in subsequent compilations. So, I’m currently using this approach for development. However, I know that directly running python setup.py is a deprecated way to build a Python module, so I’m still looking for a better solution.

I’m hoping to find ways to speed up both the initial and subsequent compilation times for my project. I’ve seen some Stack Overflow discussions suggesting ccache for this, but unfortunately, I don’t have the sudo privileges needed to install it on the server right now.

I use this image to build my torch extension with PyTorch 2.6.

Any suggestion would be very helpful. Thank you for reading my question!

lakshayg · January 7, 2026, 11:49pm

If you are using the container to build, why do you need sudo privileges on the server? You can just install ccache in the container and keep the cache in a directory shared with the host.