I’m looking at the the debug traces produced by
TORCH_COMPILE_DEBUG=1 and I can see while most of the kernels are generated by Triton/C++ code generator in Inductor, some kernels (especially mm, addmm, etc) are offloaded into external libraries/templates etc. See the example below.
from torch._inductor.select_algorithm import extern_kernels ... ... extern_kernels.mm(arg0_1, arg1_1, out=buf0) ... ...
I’m wondering if I can make Triton/C++ code generator emit code for 100% of my operators without using