CUDA `add_kernel`

Can anyone point me to where the add_kernel is implemented specifically for CUDA? Looking at CUDA traces (through nsys), I can see the dispatcher ultimately calls out to this kernel which has a gpu_kernel_impl.

libcuda.so.535.113.01!0x7f2c80680076
libcudart.so.12!cudaLaunchKernel
libtorch_cuda.so!void at::native::gpu_kernel_impl<...>(...)
libtorch_cuda.so!void at::native::gpu_kernel<...>(...)
libtorch_cuda.so!at::native::add_kernel(...)::{lambda()#1}::operator()() const
libtorch_cuda.so!at::native::add_kernel(...)
libtorch_cuda.so!at::(...)::wrapper_CUDA_add__Tensor(...)
libtorch_cpu.so!at::_ops::add__Tensor::call(...)
libtorch_cpu.so!at::native::_convolution(...)
libtorch_cpu.so!at::(...)::(...)::wrapper_CompositeExplicitAutograd___convolution(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::_convolution::call(...)
libtorch_cpu.so!at::native::convolution(...)
libtorch_cpu.so!at::(...)::(...)::wrapper_CompositeExplicitAutograd__convolution(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::convolution::redispatch(...)
libtorch_cpu.so!torch::autograd::VariableType::(...)::convolution(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::convolution::call(...)
libtorch_cpu.so!at::native::conv2d_symint(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::conv2d::call(...)
libtorch_python.so!torch::autograd::THPVariable_conv2d(...)

In past versions of PyTorch the add op was implemented in a file BinaryAddSubKernel.cu in ATen/native/cuda and registered under add_kernel_cuda. However, this file no longer exists on more recent versions of pytorch (>= 2.1).

The native_functions.yaml entry for add is:

- func: add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
  device_check: NoCheck   # TensorIterator
  structured_delegate: add.out
  variants: function, method
  dispatch:
    SparseCPU, SparseCUDA: add_sparse
    SparseCsrCPU, SparseCsrCUDA: add_sparse_csr
    MkldnnCPU: mkldnn_add
    ZeroTensor: add_zerotensor
    NestedTensorCPU, NestedTensorCUDA: NestedTensor_add_Tensor
  tags: [core, pointwise]

Any thoughts on the refactoring / re-implementation rationale would be greatly appreciated!

I would assume the TensorIterator is used from here.