Speedup with max_autotune even though all the triton mm kernels are slower

ad8e · April 2, 2024, 10:48pm

I’m comparing these two codepaths:
torch.compile(my_function, options={"max_autotune"})

torch.compile(my_function)

And max_autotune is faster. But in the logs, the triton kernels are all at 50% speed and look like this:

SingleProcess AUTOTUNE takes 3.5235 seconds
AUTOTUNE mm(16384x12288, 12288x4096)
  mm 2.3869 ms 100.0%
  triton_mm_661 4.4904 ms 53.2%
  triton_mm_662 4.6002 ms 51.9%
  triton_mm_667 4.6453 ms 51.4%
  triton_mm_664 4.7055 ms 50.7%
  triton_mm_663 5.0753 ms 47.0%
  triton_mm_660 5.9063 ms 40.4%
  triton_mm_668 6.2341 ms 38.3%
  triton_mm_670 7.5920 ms 31.4%
  triton_mm_669 12.1617 ms 19.6%

In my understanding, this means max_autotune should be doing nothing, but it’s actually 1% faster. Why might this be? Is autotune doing something else?