Matmul slow for small matrices

I’m doing a small benchmark for small matrices (3x3, 4x4) comparing against numpy and turns out that we are doing quite slow. Hardware is a razer lambda.


to reproduce: kornia-benchmark/ at master · kornia/kornia-benchmark · GitHub

pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url

wondering whether we can improve performance for small matrices to target vision/robotics applications.

Would you suggest you try out torch.compile(mod, mode="reduce-overhead") for anything on the smaller end. Although I’m not sure if a matmul is the most meaningful benchmark since inductors benefits mostly come from fusions

I’ll try — just trying to find the best way to use torch for small matrices operations.