Can anyone point me to where the add_kernel
is implemented specifically for CUDA
? Looking at CUDA traces (through nsys
), I can see the dispatcher ultimately calls out to this kernel which has a gpu_kernel_impl
.
libcuda.so.535.113.01!0x7f2c80680076
libcudart.so.12!cudaLaunchKernel
libtorch_cuda.so!void at::native::gpu_kernel_impl<...>(...)
libtorch_cuda.so!void at::native::gpu_kernel<...>(...)
libtorch_cuda.so!at::native::add_kernel(...)::{lambda()#1}::operator()() const
libtorch_cuda.so!at::native::add_kernel(...)
libtorch_cuda.so!at::(...)::wrapper_CUDA_add__Tensor(...)
libtorch_cpu.so!at::_ops::add__Tensor::call(...)
libtorch_cpu.so!at::native::_convolution(...)
libtorch_cpu.so!at::(...)::(...)::wrapper_CompositeExplicitAutograd___convolution(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::_convolution::call(...)
libtorch_cpu.so!at::native::convolution(...)
libtorch_cpu.so!at::(...)::(...)::wrapper_CompositeExplicitAutograd__convolution(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::convolution::redispatch(...)
libtorch_cpu.so!torch::autograd::VariableType::(...)::convolution(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::convolution::call(...)
libtorch_cpu.so!at::native::conv2d_symint(...)
libtorch_cpu.so!c10::impl::wrap_kernel_functor_unboxed_<...>::call(...)
libtorch_cpu.so!at::_ops::conv2d::call(...)
libtorch_python.so!torch::autograd::THPVariable_conv2d(...)
In past versions of PyTorch
the add
op was implemented in a file BinaryAddSubKernel.cu
in ATen/native/cuda
and registered under add_kernel_cuda
. However, this file no longer exists on more recent versions of pytorch (>= 2.1).
The native_functions.yaml
entry for add
is:
- func: add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
device_check: NoCheck # TensorIterator
structured_delegate: add.out
variants: function, method
dispatch:
SparseCPU, SparseCUDA: add_sparse
SparseCsrCPU, SparseCsrCUDA: add_sparse_csr
MkldnnCPU: mkldnn_add
ZeroTensor: add_zerotensor
NestedTensorCPU, NestedTensorCUDA: NestedTensor_add_Tensor
tags: [core, pointwise]
Any thoughts on the refactoring / re-implementation rationale would be greatly appreciated!