Need help for tracing the internal CPU matmul Code

Hello, I am attempting to trace the sequence of parallel multiplication and addition operations in the matrix multiplication function, torch.matmul. From the C++ code in the PyTorch GitHub repository, I’ve tracked the actual execution to a call to at::cpu::mm_out(out, mat1, mat2). This function appears to be dynamically generated by the ATen module during the compilation of PyTorch.

Upon compiling the code in PyTorch, I discovered that the dynamically generated mm_out further invokes the method structured_mm_out_cpu::impl(mat1, mat2, out). This is declared in torch\include\ATen\ops\mm_native.h, which might also be a dynamically generated file. However, I haven’t been able to locate any implementation of this method in the compiled code. I’ve searched possible inheritance scenarios, taking polymorphism into consideration.

Could you assist me in tracking this down or directing me to where the actual code for the parallel multiplication and addition of matrix multiplication elements in the tensor is located?