Compiled matmul is slower than vanilla matmul

Hi, I wonder why, in my test, compiled torch.matmul is much slower than vanilla torch matmul (>25% difference)? I understand torch.matmul uses cublas and torch compile matmul uses triton via inductor. But I would expect the vanilla cubals is a candidate of the autotune? And hence compiled matmul should at least as fast as the vanilla matmul. Is my understanding correct?

            time_taken = timeit(
                torch.matmul, A, B, out=C
            )

vs

            def matmul_wrapper(A, B, out):
                return torch.matmul(A, B, out=out)
            compiled_matmul = torch.compile(matmul_wrapper, fullgraph=True, mode="max-autotune-no-cudagraphs")
            # warmup
            compiled_matmul(A, B, out=C)
            time_taken = timeit(
                compiled_matmul, A, B, out=C
            )

env: torch 2.7 on cuda 12.8, B200

1 Like

I don’t think the default aten backend (which calls into cuBLAS) will be used unless no choices are available and aten is used as the fallback as seen here and here.

CC @yf225 to correct me