Performance Issue: torch.matmul selecting cutlass sm75 kernel for A100

Thanks for your reply.
In this case, I figured out that the sm75 kernel was getting selected when the input tensor was not contiguous. If the tensor was contiguous it was always choosing the sm80 kernel.