Triton kernel launch in TorchInductor

trusira · November 30, 2023, 6:19pm

I’m trying to understand how TorchInductor is scheduling generated Triton kernels for execution. I can see in precompile function of CachingAutotuner, kernel binaries and launchers are being populated. But I’m not sure where these launchers are actually launched/called and how corresponding cudaLaunchKernels are issued.

Could someone please point me in the right direction?
Thanks in advance!

trusira · December 7, 2023, 9:12pm

From what I was able to find, this is handled by Triton during the launcher generation for each kernel in
def generate_launcher(constants, signature, ids).
It compiles the CUDA launcher code into a shared library and commits to the codecache.

github.com

openai/triton/blob/0027cf785efa64ff7e6120bf1633b4ce6291c54d/python/triton/compiler/make_launcher.py#L39


      
              # retrieve stub from cache if it exists
              cache_path = so_cache_manager.get_file(so_name)
              if cache_path is None:
                  with tempfile.TemporaryDirectory() as tmpdir:
                      src = generate_launcher(constants, signature, ids)
                      src_path = os.path.join(tmpdir, "main.c")
                      with open(src_path, "w") as f:
                          f.write(src)
                      so = _build(name, src_path, tmpdir)
                      with open(so, "rb") as f:
                          return so_cache_manager.put(f.read(), so_name, binary=True)
              else:
                  return cache_path
          
          
          # ----- source code generation --------
          
          
          def ty_to_cpp(ty):
              if ty[0] == '*':
                  return "hipDeviceptr_t" if is_hip() else "CUdeviceptr"