Ampere sgemm 128x32 sliced1x4 nn

When pytorch executes resnet50 on the GPU, it executes this operator ampere_sgemm_128x32_sliced1x4_nn. The cuda source code of this operator is in the location of pytorch.

I can see this kernel by nvidia-nsight

5.40% 110.754 μs 7 15.822 μs 14.688 μs 14.561 μs 22.561 μs 2.972 μs 7 32 1 128 1 1 ampere_sgemm_32x32_sliced1x4_nn
4.80% 98.178 μs 42 2.337 μs 2.336 μs 2.240 μs 2.720 μs 70 ns 1 1 1 128 1 1 void at::native::vectorized_elementwise_kernel<(int)4, at::native::::batch_norm_calc_invstd(const at::Tensor &, const at::Tensor &, double)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3)
4.40% 90.498 μs 3 30.166 μs 29.889 μs 29.888 μs 30.721 μs 480 ns 1 16 10 256 1 1 ampere_sgemm_64x32_sliced1x4_nn

Could you explain what the question is?