Hi, I wonder why, in my test, compiled torch.matmul is much slower than vanilla torch matmul (>25% difference)? I understand torch.matmul uses cublas and torch compile matmul uses triton via inductor. But I would expect the vanilla cubals is a candidate of the autotune? And hence compiled matmul should at least as fast as the vanilla matmul. Is my understanding correct?
time_taken = timeit(
torch.matmul, A, B, out=C
)
vs
def matmul_wrapper(A, B, out):
return torch.matmul(A, B, out=out)
compiled_matmul = torch.compile(matmul_wrapper, fullgraph=True, mode="max-autotune-no-cudagraphs")
# warmup
compiled_matmul(A, B, out=C)
time_taken = timeit(
compiled_matmul, A, B, out=C
)
env: torch 2.7 on cuda 12.8, B200