Getting Triton to generate all kernels

I’m looking at the the debug traces produced by TORCH_COMPILE_DEBUG=1 and I can see while most of the kernels are generated by Triton/C++ code generator in Inductor, some kernels (especially mm, addmm, etc) are offloaded into external libraries/templates etc. See the example below.

from torch._inductor.select_algorithm import extern_kernels
...
...
extern_kernels.mm(arg0_1, arg1_1, out=buf0)
...
...

I’m wondering if I can make Triton/C++ code generator emit code for 100% of my operators without using extern_kernels.

Thanks!

No global flag for this as far as I know but poking around here will be your best bet https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py

1 Like

Thanks for pointing me in the right direction, this definitely seems to be the way. From what I’ve seen, gemm kernels are the ones that get predominantly handled by extern_kernels.

If I remove ATEN from config.max_autotune_gemm_backends and enable TORCHINDUCTOR_MAX_AUTOTUNE, then Triton kicks in and generates gemm kernels.

1 Like