When pytorch executes resnet50 on the GPU, it executes this operator ampere_sgemm_128x32_sliced1x4_nn. The cuda source code of this operator is in the location of pytorch.
I can see this kernel by nvidia-nsight
5.40% | 110.754 μs | 7 | 15.822 μs | 14.688 μs | 14.561 μs | 22.561 μs | 2.972 μs | 7 32 1 | 128 1 1 | ampere_sgemm_32x32_sliced1x4_nn |
---|---|---|---|---|---|---|---|---|---|---|
4.80% | 98.178 μs | 42 | 2.337 μs | 2.336 μs | 2.240 μs | 2.720 μs | 70 ns | 1 1 1 | 128 1 1 | void at::native::vectorized_elementwise_kernel<(int)4, at::native::::batch_norm_calc_invstd(const at::Tensor &, const at::Tensor &, double)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3) |
4.40% | 90.498 μs | 3 | 30.166 μs | 29.889 μs | 29.888 μs | 30.721 μs | 480 ns | 1 16 10 | 256 1 1 | ampere_sgemm_64x32_sliced1x4_nn |