Hi experts:
I enable according to this tutorial (beta) Channels Last Memory Format in PyTorch — PyTorch Tutorials 2.2.0+cu121 documentation and amp in the same time. When I profiling it with nsight system, I found a lot of unexpected genericTranspose_kernel when the input shape is [3, 512, 960]
And the genericTranspose_kernel are around cutlass kernels
And I want to know why the kernels show up so I change the input shape. When the input shape is [32, 128, 240], the genericTranspose_kernel is gone
Any insights on that? Thank you!

