Speeding up C++/CUDA extension build time

Hi! I just ported a bunch of CFFI extensions from pyTorch 0.4 to the “new” C++ extension framework needed in pyTorch 1.6.

Everything works fine, and it builds using ninja. But it rebuilds everything every time I do the build, and apart from that, the build process now is literally 100x slower than before (now it takes about 2-3 minutes to do a build, and it took 10 seconds before with CFFI :confused: ).

This might be OK for a final deploy but it’s of course horrid for development.

Any hints on both issues? Why does it “clean” the objects every time I run? And why are the individual files compiled so much more slowly now (I guess this last one is due to the massive C++ .h frameworks that now has to be pulled in?). The latter is less of an issue if the first one can get solved.

FWIW, in the build/ folder, I still see all the .o files from the last build under build/temp.linux-x86_64-3.6/ but they aren’t used the next time I build.

My build script is just something like this:

CUDAExtension(name=‘my_render’, sources=[‘my_render.cpp’, ‘my_render_gaussian_cuda.cu’]),
cmdclass={‘build_ext’: BuildExtension})

and I invoke it with:

MAX_JOBS=6 TORCH_CUDA_ARCH_LIST=Pascal ./build_cudaextensions.py install --prefix=$HOME/.local

I know I can trim the arch list and just set 6.1 for example (Pascal implies 6.0, 6.1 so some speedup per file could be done there).

Sry, I missed this one.
We did a lot changes from 0.4 to 1.6, the answer to your second issue is as what you said, we have a centralized torch.h added into the framework which will take a bunch of time to pull the symbols in.

I have no clear answer to your first issue at the moment, I was working on libtorch 1.5 with cpp files, it seems to me that ninja won’t rebuild everything. A following build is normally much faster than the first build. The ninja support is added not long ago by someone within the pytorch team, I will dig a bit and answer you later.

Hi! Yes, installing ninja and adding MAX_JOBS=4 along with invoking the CUDAExtension setup stuff with -j 4 significantly sped things up, including specifying only a single arch to build (6.1 for example in my case). Indeed it seems ninja doesn’t rebuild at every invocation (seemed it did when I wrote this post).

Still it’s a different flow (requiring the install step as well now) and some more clear tutorials might be good at some point, not only of adding a differentiable major building block but also of some small and trivial op.

Making this customization process as smooth as possible is good because torch can’t be best at everything, and being able to add small snippets of custom CUDA here and there in the bottlenecks can increase performance a lot.

Torch proper is going in a very flexible, general, backend-agnostic direction and I’m just glad it’s still possible to add an unflexible, hardcoded piece of cuda that doesn’t have to cater to every context :slight_smile: