How to use Inductor to generate Triton code on GPU?

I build the pytorch from source with the commit id :f9a250c35bd061e2e6f4c2d92e2b1b16390e8636

I want to use the inductor to help me generate the triton code on the GPU? How to do it?
When I write the softmax, matmul, or bmm kernels and compile it with “TORCH_COMPILE_DEBUG=1” flag, I find all of them are executed with the aten:: implementation in the torch_compile_debug folder.
How to assign a kernel to execute with the triton template directly? Is there any flag I should set?
Thanks a lot!