I’m looking at the the debug traces produced by TORCH_COMPILE_DEBUG=1
and I can see while most of the kernels are generated by Triton/C++ code generator in Inductor, some kernels (especially mm, addmm, etc) are offloaded into external libraries/templates etc. See the example below.
from torch._inductor.select_algorithm import extern_kernels
...
...
extern_kernels.mm(arg0_1, arg1_1, out=buf0)
...
...
I’m wondering if I can make Triton/C++ code generator emit code for 100% of my operators without using extern_kernels
.
Thanks!