Hello,

Pytorch has very optimized matrix operations when it comes to the standart operations

such as torch.matmul, torch.sparse.mm, …

But where do I find the most fundamental CUDA kernels for these algorithms when I want to participate in the development?

I have been looking in the **PyTorch** packages as well as in the **ATen** library but was not able to find them in e.g. the form

```
// CUDA Kernel function to add the elements of two arrays on the GPU
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
```

(Example taken from An Even Easier Introduction to CUDA | NVIDIA Technical Blog)