I’m comparing these two codepaths:
torch.compile(my_function, options={"max_autotune"})
torch.compile(my_function)
And max_autotune
is faster. But in the logs, the triton kernels are all at 50% speed and look like this:
SingleProcess AUTOTUNE takes 3.5235 seconds
AUTOTUNE mm(16384x12288, 12288x4096)
mm 2.3869 ms 100.0%
triton_mm_661 4.4904 ms 53.2%
triton_mm_662 4.6002 ms 51.9%
triton_mm_667 4.6453 ms 51.4%
triton_mm_664 4.7055 ms 50.7%
triton_mm_663 5.0753 ms 47.0%
triton_mm_660 5.9063 ms 40.4%
triton_mm_668 6.2341 ms 38.3%
triton_mm_670 7.5920 ms 31.4%
triton_mm_669 12.1617 ms 19.6%
In my understanding, this means max_autotune should be doing nothing, but it’s actually 1% faster. Why might this be? Is autotune doing something else?