Triton can and does communicate with Pytorch for PTX/cubin codegen. Furthermore, I see Pytorch implements a lightweight version of Triton’s CachingAutotuner
class, even though, I’m a little confused as to who (between Triton and Pytorch) actually handles kernel launching during runtime. I asked this in a different post here.
AFAIK, the autotuning apparatus is used irrespective of whether you’re autotuning multiple configs or not. In the single kernel (i.e., no autotune) case, it will just generate a single kernel and launch the kernel. See def run
here. It handles the multiple config case separately.
if len(self.launchers) != 1:
if len(self.launchers) == 0:
self.precompile()
if len(self.launchers) > 1:
self.autotune_to_one_config(*args, grid=grid, **kwargs)
However, I’m not certain at this point, if TorchInductor actually reuses Triton JIT runtime or has its own mechanism to launch kernels.