I’m using an AMD Ryzen 9, and I’m trying to compile a simple CUDA extension that just adds 1 to a tensor. However, my compilation time is 41 seconds, which seems really high.
Not sure what the slow part is here, (compiler, linker, etc.) – if it’s the linker, I wonder if I can use something like the Mold linker to speed up the compilation process?
After further investigation, I extracted out the ninja file and ran it manually, and it takes ~ 38 seconds to execute. AFAIK, there’s no linking going on in this file, so the issue really is just slow compilation:
It was a while since I fiddled around with the cuda extensions (as per your linked question of mine…) but IIRC the framework pulled in more or less all .h files in the entire system, for each file to compile. Apart from selecting a single arch to build for and getting ninja working and setting MAX_JOBS correctly (like it seems you’ve done), my only suggestion is to check that your SSD is fast and you have a boatload of RAM to cache all those files in.
I was disappointed because the same code took just a few seconds to compile with the previous pytorch version at that point and suddenly it was 100x slower or so.
Another option I didn’t look into would be to use ccache in some way, but that won’t help if it’s nvcc that is slow I guess. Can you see if it’s just a huge number of files you compile or if it’s few files but they take ages for each file?