Ampere sgemm 128x32 sliced1x4 nn

hst123 · January 3, 2024, 2:20am

When pytorch executes resnet50 on the GPU, it executes this operator ampere_sgemm_128x32_sliced1x4_nn. The cuda source code of this operator is in the location of pytorch.

I can see this kernel by nvidia-nsight

5.40%	110.754 μs	7	15.822 μs	14.688 μs	14.561 μs	22.561 μs	2.972 μs	7 32 1	128 1 1	ampere_sgemm_32x32_sliced1x4_nn
4.80%	98.178 μs	42	2.337 μs	2.336 μs	2.240 μs	2.720 μs	70 ns	1 1 1	128 1 1	void at::native::vectorized_elementwise_kernel<(int)4, at::native::::batch_norm_calc_invstd(const at::Tensor &, const at::Tensor &, double)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3)
4.40%	90.498 μs	3	30.166 μs	29.889 μs	29.888 μs	30.721 μs	480 ns	1 16 10	256 1 1	ampere_sgemm_64x32_sliced1x4_nn

ptrblck · January 3, 2024, 1:11pm

Could you explain what the question is?