Are there any tutorial for customizing functional using c++ with cuda?

Hello. I need to implement some custom op on feature map such as filter. However with pytorch built-in function the implementation have to stack some op which make it memory inefficient. Therefore, I want to using c++ and cuda to write a custom functional. Can anyone give me a hint about the pipeline of customize functional using c++?
Or can I manipulate the tensor or variable in c++? I need to write cuda code using c++.

You can find an example on how to write C/CUDA code that interfaces with pytorch here.

An example in a project can be found in this Faster R-CNN project

I just read the c source code in the src directory. I found that it is invoking functions in <THC/THC.h>. I wonder weather I can manipulate the tensor using cuda parallel programming, because I need to implement a fiter over feature, using im2col in tensor operation take up too much gpu memory. If I can manipulate the tensor using cuda parallel programming, them I can implement the filter operation efficiently.Thanks!

I have another question. In the c code backward function, it only return one tensor. Why not return two tensor as the add functional have two tensor as input? Thanks!

Yes, you can use arbitrary cuda code, check for example the Faster R-CNN code I sent you, they define CUDA kernels in there, and you can use thrust or any other CUDA library you want (because you can get pointers to the underlying cuda memory).

The add function is just an example, and is there mostly to illustrate how to use cffi with pytorch

Hi. I’ve implement my function in pytorch with cuda. Can I profile the function using nvvp with pytorch? Or I have to write the code to timing the program manually?

you can use nvvp and it will work.