The generated triton MaxPool2d kernel has poor performance on some platforms

I conducted maxpool2d forward kernel testing on some platforms and found that the new triton kernel improved performance on some platforms, mainly due to the efficient use of L1 Cache. However, on some platforms with different L1 cache architectures, the performance decreased significantly compared to the native kernel. MaxPool2d is regarded as a pointwise op in torch.Because the size of the parameter stride changes, the data loading performance will change greatly. Is this a question to consider?

Could you create an issue on GitHub describing your profiling in detail and sharing the code you’ve used to profile these kernels?

Ok, thank you. here is the issue. https://github.com/pytorch/pytorch/issues/107441