Speeding up C++/CUDA extension build time

Hi! Yes, installing ninja and adding MAX_JOBS=4 along with invoking the CUDAExtension setup stuff with -j 4 significantly sped things up, including specifying only a single arch to build (6.1 for example in my case). Indeed it seems ninja doesn’t rebuild at every invocation (seemed it did when I wrote this post).

Still it’s a different flow (requiring the install step as well now) and some more clear tutorials might be good at some point, not only of adding a differentiable major building block but also of some small and trivial op.

Making this customization process as smooth as possible is good because torch can’t be best at everything, and being able to add small snippets of custom CUDA here and there in the bottlenecks can increase performance a lot.

Torch proper is going in a very flexible, general, backend-agnostic direction and I’m just glad it’s still possible to add an unflexible, hardcoded piece of cuda that doesn’t have to cater to every context :slight_smile: