I’m trying to understand how TorchInductor is scheduling generated Triton kernels for execution. I can see in precompile function of CachingAutotuner, kernel binaries and launchers are being populated. But I’m not sure where these launchers are actually launched/called and how corresponding cudaLaunchKernels are issued.
Could someone please point me in the right direction?
Thanks in advance!
From what I was able to find, this is handled by Triton during the launcher generation for each kernel in def generate_launcher(constants, signature, ids).
It compiles the CUDA launcher code into a shared library and commits to the codecache.